1 Introduction

The dataset used in this project contains the following information about 205 cars:

  • symboling (risk level): categorical variable -3, -2, -1, 0, 1, 2, 3.
  • make (Brand of the car): categorical variable (alfa-romeo, audi, bmw, chevrolet, dodge, honda, isuzu, jaguar, mazda, mercedes-benz, mercury, mitsubishi, nissan, peugeot, plymouth, porsche, renault, saab, subaru, toyota, volkswagen, volvo)
  • fuel_type: binary variable (diesel, gas)
  • aspiration (engine’s aspiration): binary variable (std, turbo)
  • num_of_doors (number of doors): binary variable (four, two)
  • body_style (type of car): categorical variable (hardtop, wagon, sedan, hatchback, convertible)
  • drive_wheels (drive wheels): categorical variable (4wd, fwd, rwd)
  • engine_location: binary variable (front, rear).
  • wheel_base (distance between the wheels): numeric variable which values are between 86.6 and 120.9.
  • length: numeric variable which values are between 141.1 and 208.1.
  • width: numeric variable which values are between 60.3 and 72.3.
  • height: numeric variable which values are between 47.8 and 59.8.
  • curb_weight: numeric variable which values are between 1488 and 4066.
  • engine_type: categorical variable (dohc, dohcv, l, ohc, ohcf, ohcv, rotor).
  • num_of_cylinders (number of cylinders): categorical variable (eight, five, four, six, three, twelve, two).
  • engine_size: numeric variable which values are between 61 and 326.
  • fuel_system: categorical variable (1bbl, 2bbl, 4bbl, idi, mfi, mpfi, spdi, spfi).
  • bore (engine’s diameter): numeric variable which values are between 2.54 and 3.94.
  • stroke: numeric variable which values are between 2.07 and 4.17.
  • compression_ratio: numeric variable which values are between 7 and 23.
  • horsepower: numeric variable which values are between 48 and 288.
  • peak_rpm (maximum revolutions per minute): numeric variable which values are between 4150 and 6600.
  • city_mpg: numeric variable which values are between 13 and 49.
  • highway_mpg: numeric variable which values are between 16 and 54.
  • price: numeric variable which values are between 5118 and 45400.

All the variables are meassured in the United States’ units such as miles, dollars, inches, …

skimr::skim(data)
Data summary
Name data
Number of rows 205
Number of columns 25
_______________________
Column type frequency:
factor 11
numeric 14
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
symboling 0 1.00 FALSE 6 0: 67, 1: 54, 2: 32, 3: 27
make 0 1.00 FALSE 22 toy: 32, nis: 18, maz: 17, hon: 13
fuel_type 0 1.00 FALSE 2 gas: 185, die: 20
aspiration 0 1.00 FALSE 2 std: 168, tur: 37
num_of_doors 2 0.99 FALSE 2 fou: 114, two: 89
body_style 0 1.00 FALSE 5 sed: 96, hat: 70, wag: 25, har: 8
drive_wheels 0 1.00 FALSE 3 fwd: 120, rwd: 76, 4wd: 9
engine_location 0 1.00 FALSE 2 fro: 202, rea: 3
engine_type 0 1.00 FALSE 7 ohc: 148, ohc: 15, ohc: 13, doh: 12
num_of_cylinders 0 1.00 FALSE 7 fou: 159, six: 24, fiv: 11, eig: 5
fuel_system 0 1.00 FALSE 8 mpf: 94, 2bb: 66, idi: 20, 1bb: 11

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
wheel_base 0 1.00 98.76 6.02 86.60 94.50 97.00 102.40 120.90 ▁▇▂▁▁
length 0 1.00 174.05 12.34 141.10 166.30 173.20 183.10 208.10 ▁▅▇▃▁
width 0 1.00 65.91 2.15 60.30 64.10 65.50 66.90 72.30 ▁▇▇▂▁
height 0 1.00 53.72 2.44 47.80 52.00 54.10 55.50 59.80 ▁▆▇▆▂
curb_weight 0 1.00 2555.57 520.68 1488.00 2145.00 2414.00 2935.00 4066.00 ▃▇▅▃▁
engine_size 0 1.00 126.91 41.64 61.00 97.00 120.00 141.00 326.00 ▇▆▂▁▁
bore 4 0.98 3.33 0.27 2.54 3.15 3.31 3.59 3.94 ▁▅▇▇▂
stroke 4 0.98 3.26 0.32 2.07 3.11 3.29 3.41 4.17 ▁▂▇▇▁
compressio_ratio 0 1.00 10.14 3.97 7.00 8.60 9.00 9.40 23.00 ▇▁▁▁▁
horsepower 2 0.99 104.26 39.71 48.00 70.00 95.00 116.00 288.00 ▇▅▂▁▁
peak_rpm 2 0.99 5125.37 479.33 4150.00 4800.00 5200.00 5500.00 6600.00 ▂▇▇▂▁
city_mpg 0 1.00 17.12 0.68 17.00 17.00 17.00 17.00 24.00 ▇▁▁▁▁
highway_mpg 0 1.00 30.75 6.89 16.00 25.00 30.00 34.00 54.00 ▂▇▆▁▁
price 4 0.98 13207.13 7947.07 5118.00 7775.00 10295.00 16500.00 45400.00 ▇▃▁▁▁

In the previous summary of the data, we can see a big difference in the values from the numerical variables as the data is not normalized. Therefore, we can expect the variances to be quite different just because of the units of measurements. This has to be taken into consideration before carrying out any dimension reduction analysis.

Moreover, there are a few missing values in the following variables: num_of_doors, bore, stroke, horsepower, peak_rpm and price. As this absence may difficult the study, we will perform an imputation based on predictive mean matching for these variables.

The predictive mean matching will complete the observations using a linear regression of the missing value using the rest of the variables. The estimation of the predictors is accomplished with the covariance matrix.

\[ x_j = \beta_0 + \beta_1 * x_1 + ... + \beta_p * x_p + \epsilon_j \]

data_imp = mice(data,method="pmm", seed = 100407485)
## 
##  iter imp variable
##   1   1  num_of_doors*  bore*  stroke*  horsepower*  peak_rpm*  price*
##   1   2  num_of_doors  bore*  stroke*  horsepower*  peak_rpm*  price*
##   1   3  num_of_doors*  bore*  stroke*  horsepower*  peak_rpm*  price*
##   1   4  num_of_doors*  bore*  stroke*  horsepower*  peak_rpm*  price*
##   1   5  num_of_doors*  bore*  stroke*  horsepower*  peak_rpm*  price*
##   2   1  num_of_doors*  bore*  stroke*  horsepower  peak_rpm*  price*
##   2   2  num_of_doors*  bore*  stroke  horsepower*  peak_rpm*  price*
##   2   3  num_of_doors*  bore*  stroke*  horsepower*  peak_rpm*  price*
##   2   4  num_of_doors*  bore*  stroke*  horsepower*  peak_rpm*  price*
##   2   5  num_of_doors*  bore*  stroke*  horsepower*  peak_rpm*  price*
##   3   1  num_of_doors*  bore*  stroke*  horsepower*  peak_rpm*  price*
##   3   2  num_of_doors*  bore*  stroke*  horsepower*  peak_rpm*  price*
##   3   3  num_of_doors*  bore*  stroke*  horsepower*  peak_rpm*  price*
##   3   4  num_of_doors*  bore*  stroke*  horsepower*  peak_rpm*  price*
##   3   5  num_of_doors*  bore*  stroke*  horsepower*  peak_rpm*  price*
##   4   1  num_of_doors*  bore*  stroke*  horsepower*  peak_rpm*  price*
##   4   2  num_of_doors*  bore*  stroke  horsepower*  peak_rpm*  price*
##   4   3  num_of_doors*  bore*  stroke*  horsepower*  peak_rpm*  price*
##   4   4  num_of_doors*  bore*  stroke*  horsepower*  peak_rpm*  price*
##   4   5  num_of_doors*  bore*  stroke*  horsepower*  peak_rpm*  price*
##   5   1  num_of_doors*  bore  stroke*  horsepower*  peak_rpm*  price*
##   5   2  num_of_doors*  bore*  stroke  horsepower*  peak_rpm*  price*
##   5   3  num_of_doors*  bore*  stroke*  horsepower*  peak_rpm*  price*
##   5   4  num_of_doors*  bore*  stroke*  horsepower*  peak_rpm*  price*
##   5   5  num_of_doors*  bore*  stroke*  horsepower*  peak_rpm*  price*
data_imp = complete(data_imp)
data <- data_imp

2 Data anlaysis

The main characteristics of the quantitative variables are:

# Mean vector
m = colMeans(data[,c("wheel_base", "length", "width", "height", "curb_weight", "engine_size", "bore", "stroke", "compressio_ratio", "horsepower", "peak_rpm", "city_mpg", "highway_mpg", "price")], na.rm = T)

\[ \overline{x} = (98.76, 174.05, 65.91, 53.72, 2555.57, 126.91, 3.33 ) \]

That is to say that mean measurements of the cars that are in this data set are:

  • Wheel base: 98.76 inches.
  • Length: 174.05 inches.
  • Width: 65.91 inches.
  • Height: 53.72 inches.
  • Curb weight: 2555.57 pounds.

This cars also have an engine size with a mean of 126.91 cubic inches, while the mean of the bore and stroke is of 3.33 and 3.26 inches respectively. The mean of the compression ratio is 10.14. In addition, the vehicles have mean of the horsepower of 104.26, with the mean of the peak revolutions per minute of 5125.37 revolutions, and the mean consumption in the city and in the highway are 17.12 and 30.75 miles per gallon each.

These characteristics make the cars' value to be around 13207.13$.

Before, calculating the covariance and the correlation matrixes, we will eliminate the city_mpg variable as it is almost constant and it interferes with the study.

data <- data[,!names(data) %in% c("city_mpg")]
# Covariance matrix
S = var(data[,c("wheel_base", "length", "width", "height", "curb_weight", "engine_size", "bore", "stroke", "compressio_ratio", "horsepower", "peak_rpm", "highway_mpg", "price")], na.rm = T)
round(S, 2)
##                  wheel_base   length    width  height curb_weight engine_size
## wheel_base            36.26    64.98    10.27    8.67     2434.30      142.77
## length                64.98   152.21    22.26   14.80     5638.34      351.08
## width                 10.27    22.26     4.60    1.46      968.45       65.70
## height                 8.67    14.80     1.46    5.97      376.05        6.83
## curb_weight         2434.30  5638.34   968.45  376.05   271107.87    18443.03
## engine_size          142.77   351.08    65.70    6.83    18443.03     1734.11
## bore                   0.83     2.08     0.33    0.16       92.90        7.15
## stroke                 0.35     0.56     0.13    0.01       29.30        3.31
## compressio_ratio       5.97     7.76     1.54    2.54      313.04        4.79
## horsepower            85.17   267.56    54.10  -10.22    15470.60     1331.47
## peak_rpm           -1010.63 -1737.84  -231.10 -365.30   -65787.45    -4907.10
## highway_mpg          -22.56   -59.87   -10.00   -1.81    -2859.42     -194.28
## price              27862.47 67150.39 12163.95 2840.08  3358912.33   281913.60
##                     bore stroke compressio_ratio horsepower   peak_rpm
## wheel_base          0.83   0.35             5.97      85.17   -1010.63
## length              2.08   0.56             7.76     267.56   -1737.84
## width               0.33   0.13             1.54      54.10    -231.10
## height              0.16   0.01             2.54     -10.22    -365.30
## curb_weight        92.90  29.30           313.04   15470.60  -65787.45
## engine_size         7.15   3.31             4.79    1331.47   -4907.10
## bore                0.08   0.00             0.01       6.01     -43.43
## stroke              0.00   0.11             0.24       0.68     -26.83
## compressio_ratio    0.01   0.24            15.78     -31.53    -812.12
## horsepower          6.01   0.68           -31.53    1590.39    2982.54
## peak_rpm          -43.43 -26.83          -812.12    2982.54  236744.98
## highway_mpg        -1.01   0.00             7.25    -209.93    -180.60
## price            1138.46 211.68          2354.57  231850.01 -392664.55
##                  highway_mpg       price
## wheel_base            -22.56    27862.47
## length                -59.87    67150.39
## width                 -10.00    12163.95
## height                 -1.81     2840.08
## curb_weight         -2859.42  3358912.33
## engine_size          -194.28   281913.60
## bore                   -1.01     1138.46
## stroke                  0.00      211.68
## compressio_ratio        7.25     2354.57
## horsepower           -209.93   231850.01
## peak_rpm             -180.60  -392664.55
## highway_mpg            47.42   -37747.13
## price              -37747.13 62651214.04

As it was expected, the values from the variances differ a lot between each variable due to the measurements. The variable with the highest variance is the price.

# Correlation matrix
R = cor(data[,c("wheel_base", "length", "width", "height", "curb_weight", "engine_size", "bore", "stroke", "compressio_ratio", "horsepower", "peak_rpm", "highway_mpg", "price")], use="complete.obs")
round(R, 2)
##                  wheel_base length width height curb_weight engine_size  bore
## wheel_base             1.00   0.87  0.80   0.59        0.78        0.57  0.49
## length                 0.87   1.00  0.84   0.49        0.88        0.68  0.60
## width                  0.80   0.84  1.00   0.28        0.87        0.74  0.54
## height                 0.59   0.49  0.28   1.00        0.30        0.07  0.23
## curb_weight            0.78   0.88  0.87   0.30        1.00        0.85  0.63
## engine_size            0.57   0.68  0.74   0.07        0.85        1.00  0.61
## bore                   0.49   0.60  0.54   0.23        0.63        0.61  1.00
## stroke                 0.18   0.14  0.18   0.01        0.17        0.24  0.02
## compressio_ratio       0.25   0.16  0.18   0.26        0.15        0.03  0.01
## horsepower             0.35   0.54  0.63  -0.10        0.75        0.80  0.54
## peak_rpm              -0.34  -0.29 -0.22  -0.31       -0.26       -0.24 -0.32
## highway_mpg           -0.54  -0.70 -0.68  -0.11       -0.80       -0.68 -0.52
## price                  0.58   0.69  0.72   0.15        0.82        0.86  0.51
##                  stroke compressio_ratio horsepower peak_rpm highway_mpg price
## wheel_base         0.18             0.25       0.35    -0.34       -0.54  0.58
## length             0.14             0.16       0.54    -0.29       -0.70  0.69
## width              0.18             0.18       0.63    -0.22       -0.68  0.72
## height             0.01             0.26      -0.10    -0.31       -0.11  0.15
## curb_weight        0.17             0.15       0.75    -0.26       -0.80  0.82
## engine_size        0.24             0.03       0.80    -0.24       -0.68  0.86
## bore               0.02             0.01       0.54    -0.32       -0.52  0.51
## stroke             1.00             0.19       0.05    -0.17        0.00  0.08
## compressio_ratio   0.19             1.00      -0.20    -0.42        0.27  0.07
## horsepower         0.05            -0.20       1.00     0.15       -0.76  0.73
## peak_rpm          -0.17            -0.42       0.15     1.00       -0.05 -0.10
## highway_mpg        0.00             0.27      -0.76    -0.05        1.00 -0.69
## price              0.08             0.07       0.73    -0.10       -0.69  1.00
corrplot(R,order="hclust")

These correlation matrix show a positive linear relation between the variables that indicate the dimension's of the car and the ones related to the power of the engine. In addition, these two groups are correlated to the price positively. Therefore, we can say that the most expensive cars will be the bigger ones with a powerful engine. On the other hand, we can see that the bigger and more expensive the car, the less miles per gallon (they consume more).

A scatterplot matrix with every numeric variable is shown, in order to carry out the first graphical analysis.

ggpairs(data,         
        columns = c("wheel_base", "length", "width", "height", "curb_weight", "engine_size", "bore", "stroke", "compressio_ratio", "horsepower", "peak_rpm", "highway_mpg", "price"), 
        upper = list(continuous = wrap("cor", size = 1.5)),
        lower = list(continuous = wrap("points", size = 0.5)))  + 
  theme(strip.text.x = element_text(size = 4),
        strip.text.y = element_text(size = 2),
        axis.text.x = element_text(size = 2),
        axis.text.y = element_text(size = 2))

In the plot, it can be seen that the length and the peak_rpm of the car are the only variables which are not heavily skewed. All the other variables are right skewed (the mean is larger than the median), except for the height, the bore and the stroke which are left skewed (the mean is smaller than the median).

Moreover, a correlation analysis can be carried out with this plot:

  • Positive correlation between the wheel_base, the length, the width, the curb_weight and the engine_size of the car: The vehicles with a large wheel base will be the bigger ones as they will also be longer, wider and heavier.

  • Positive correlation between the engine_size, the horsepower and the price of the car: the bigger cars are usually the ones with a the most powerful engines. Therefore, the price are also higher.

  • The price has a positive correlation with the width, the curb_weight, the engine_size and the horsepower but a negative correlation with the highway_mpg: The bigger cars are the ones that consume the most fuel int he highway.

2.1 Group differenciation

2.1.1 Fuel_type

The main characteristics for each group are calculated in order to get a first idea of the differences these groups may have.

We have to take into account that there is a big difference in the sample sizes for each group: there are only 20 cars fueled by diesel while there are 185 cars powered by gas. Therefore, the conclusions may not be reliable.

# Mean vector
m_fuel = aggregate(x = data[,c("wheel_base", "length", "width", "height", "curb_weight", "engine_size", "bore", "stroke", "compressio_ratio", "horsepower", "peak_rpm", "highway_mpg", "price")],               
          by = list(data$`fuel_type`),              
          FUN = mean)

m_fuel

After comparing both mean vectors, we can say that the cars fueled by gas are usually smaller but more powerful and environmentally friendly. However, the engine size does not seem different at first, we would have to carry out an hypothesis test to check the existance of this difference. Moreover, the gas cars have are cheaper if we look at the mean, which may be caused due to the size as they are smaller.

# Covariance matrix
data_num = data[,c("wheel_base", "length", "width", "height", "curb_weight", "engine_size", "bore", "stroke", "compressio_ratio", "horsepower", "peak_rpm", "highway_mpg", "price")]
data_gas = data_num[data$`fuel_type`=="gas",]
data_diesel = data_num[data$`fuel_type`=="diesel",]

(S_diesel = cov(data_diesel))
##                     wheel_base        length         width        height
## wheel_base          47.1893684    76.5011579    13.7350526    8.40089474
## length              76.5011579   131.4598947    23.5165263   14.73647368
## width               13.7350526    23.5165263     5.0685263    2.31100000
## height               8.4008947    14.7364737     2.3110000    2.69397368
## curb_weight       3821.7978947  6316.3231579  1261.6957895  674.39157895
## engine_size        188.4178947   303.8400000    66.3094737   28.79000000
## bore                 1.4819421     2.3142684     0.4227000    0.26399211
## stroke               0.5268474     0.8078895     0.1820368    0.08627105
## compressio_ratio    -3.9831053    -6.1011053    -1.0485789   -0.78071053
## horsepower         154.3257895   254.9057895    56.3489474   24.38868421
## peak_rpm         -1108.5789474 -1632.4736842  -272.4210526 -188.13157895
## highway_mpg        -53.3868421   -87.5184211   -17.2184211   -8.52763158
## price            41364.9278947 67938.1405263 16108.7478947 5258.03131579
##                    curb_weight   engine_size          bore       stroke
## wheel_base          3821.79789    188.417895    1.48194211   0.52684737
## length              6316.32316    303.840000    2.31426842   0.80788947
## width               1261.69579     66.309474    0.42270000   0.18203684
## height               674.39158     28.790000    0.26399211   0.08627105
## curb_weight       342676.80000  17824.831579  129.72042105  49.33852632
## engine_size        17824.83158   1031.378947    7.30352632   2.97278947
## bore                 129.72042      7.303526    0.08141553   0.02076342
## stroke                49.33853      2.972789    0.02076342   0.01191026
## compressio_ratio    -342.49895    -19.287895   -0.20886579  -0.06197632
## horsepower         14554.20000    797.878947    4.95313158   2.20581579
## peak_rpm          -91711.05263  -4667.894737  -60.56052632 -15.54473684
## highway_mpg        -4662.00000   -237.078947   -1.68513158  -0.57381579
## price            3990844.92632 224512.205263 1173.80507895 690.81807895
##                  compressio_ratio    horsepower      peak_rpm   highway_mpg
## wheel_base          -3.983105e+00    154.325789   -1108.57895 -5.338684e+01
## length              -6.101105e+00    254.905789   -1632.47368 -8.751842e+01
## width               -1.048579e+00     56.348947    -272.42105 -1.721842e+01
## height              -7.807105e-01     24.388684    -188.13158 -8.527632e+00
## curb_weight         -3.424989e+02  14554.200000  -91711.05263 -4.662000e+03
## engine_size         -1.928789e+01    797.878947   -4667.89474 -2.370789e+02
## bore                -2.088658e-01      4.953132     -60.56053 -1.685132e+00
## stroke              -6.197632e-02      2.205816     -15.54474 -5.738158e-01
## compressio_ratio     6.447105e-01    -13.244474     159.55263  4.153947e+00
## horsepower          -1.324447e+01    673.839474   -3303.42105 -2.025658e+02
## peak_rpm             1.595526e+02  -3303.421053   57394.73684  1.227632e+03
## highway_mpg          4.153947e+00   -202.565789    1227.63158  7.440789e+01
## price               -2.933590e+03 188064.771053 -756516.05263 -5.394343e+04
##                          price
## wheel_base          41364.9279
## length              67938.1405
## width               16108.7479
## height               5258.0313
## curb_weight       3990844.9263
## engine_size        224512.2053
## bore                 1173.8051
## stroke                690.8181
## compressio_ratio    -2933.5903
## horsepower         188064.7711
## peak_rpm          -756516.0526
## highway_mpg        -53943.4342
## price            60215174.4500
(S_gas =  cov(data_gas))
##                     wheel_base        length         width        height
## wheel_base          31.5080511    58.7365755  8.936954e+00    7.31662250
## length              58.7365755   147.5453596  2.079283e+01   12.86681081
## width                8.9369536    20.7928305  4.299642e+00    0.99715188
## height               7.3166225    12.8668108  9.971519e-01    5.80531551
## curb_weight       2071.3590041  5269.8623090  8.804974e+02  260.05584606
## engine_size        132.8623942   349.4370065  6.437960e+01    2.36779377
## bore                 0.7335082     2.0135600  3.088380e-01    0.13221202
## stroke               0.1645449     0.3075881  7.604377e-02   -0.06227529
## compressio_ratio    -1.0137117    -2.1374933 -3.554993e-01   -0.12345073
## horsepower          91.5720917   288.8049060  5.769899e+01   -8.94572562
## peak_rpm          -544.0339307 -1105.3376910 -1.032645e+02 -212.53437133
## highway_mpg        -22.2152526   -61.1724471 -1.004702e+01   -2.13863396
## price            24752.8842685 64796.0629994  1.131831e+04 1906.61326381
##                    curb_weight   engine_size          bore       stroke
## wheel_base          2071.35900    132.862394  7.335082e-01  0.164544888
## length              5269.86231    349.437006  2.013560e+00  0.307588102
## width                880.49741     64.379598  3.088380e-01  0.076043772
## height               260.05585      2.367794  1.322120e-01 -0.062275294
## curb_weight       251001.24971  18243.596063  8.732101e+01 17.382958284
## engine_size        18243.59606   1806.791598  7.113787e+00  3.105043184
## bore                  87.32101      7.113787  7.893871e-02 -0.002003211
## stroke                17.38296      3.105043 -2.003211e-03  0.108816816
## compressio_ratio    -107.97679     -5.258311 -4.244221e-02 -0.072542092
## horsepower         16446.23414   1414.220035  6.283735e+00  1.089693008
## peak_rpm          -35320.67421  -4237.414806 -3.736937e+01 -8.290085194
## highway_mpg        -2854.13646   -195.149177 -9.762920e-01 -0.058898649
## price            3198176.68346 286459.443155  1.122707e+03 83.138801410
##                  compressio_ratio    horsepower      peak_rpm   highway_mpg
## wheel_base          -1.013712e+00     91.572092 -5.440339e+02 -2.221525e+01
## length              -2.137493e+00    288.804906 -1.105338e+03 -6.117245e+01
## width               -3.554993e-01     57.698986 -1.032645e+02 -1.004702e+01
## height              -1.234507e-01     -8.945726 -2.125344e+02 -2.138634e+00
## curb_weight         -1.079768e+02  16446.234136 -3.532067e+04 -2.854136e+03
## engine_size         -5.258311e+00   1414.220035 -4.237415e+03 -1.951492e+02
## bore                -4.244221e-02      6.283735 -3.736937e+01 -9.762920e-01
## stroke              -7.254209e-02      1.089693 -8.290085e+00 -5.889865e-02
## compressio_ratio     4.764337e-01     -6.044762  5.592778e+01  1.900237e+00
## horsepower          -6.044762e+00   1648.915100  2.067017e+03 -2.025480e+02
## peak_rpm             5.592778e+01   2067.016745  2.007171e+05  9.341951e-01
## highway_mpg          1.900237e+00   -202.548032  9.341951e-01  4.296839e+01
## price               -1.017411e+03 244019.086222 -1.316141e+05 -3.760490e+04
##                          price
## wheel_base          24752.8843
## length              64796.0630
## width               11318.3139
## height               1906.6133
## curb_weight       3198176.6835
## engine_size        286459.4432
## bore                 1122.7068
## stroke                 83.1388
## compressio_ratio    -1017.4113
## horsepower         244019.0862
## peak_rpm          -131614.0937
## highway_mpg        -37604.8994
## price            62331611.9750
# Correlation matrix

(R_diesel = cor(data_diesel))
##                  wheel_base     length      width     height curb_weight
## wheel_base        1.0000000  0.9712908  0.8881113  0.7450867   0.9503931
## length            0.9712908  1.0000000  0.9110363  0.7830694   0.9410774
## width             0.8881113  0.9110363  1.0000000  0.6254065   0.9573516
## height            0.7450867  0.7830694  0.6254065  1.0000000   0.7018966
## curb_weight       0.9503931  0.9410774  0.9573516  0.7018966   1.0000000
## engine_size       0.8540650  0.8251619  0.9171188  0.5461801   0.9481432
## bore              0.7560587  0.7073974  0.6580180  0.5636904   0.7766268
## stroke            0.7027519  0.6456462  0.7408969  0.4816236   0.7722949
## compressio_ratio -0.7221332 -0.6627191 -0.5800667 -0.5923944  -0.7286761
## horsepower        0.8654425  0.8564556  0.9641991  0.5724183   0.9577844
## peak_rpm         -0.6736100 -0.5943107 -0.5050842 -0.4784417  -0.6539481
## highway_mpg      -0.9009540 -0.8848982 -0.8866311 -0.6023130  -0.9232523
## price             0.7759919  0.7635967  0.9220786  0.4128322   0.8785562
##                  engine_size       bore     stroke compressio_ratio horsepower
## wheel_base         0.8540650  0.7560587  0.7027519       -0.7221332  0.8654425
## length             0.8251619  0.7073974  0.6456462       -0.6627191  0.8564556
## width              0.9171188  0.6580180  0.7408969       -0.5800667  0.9641991
## height             0.5461801  0.5636904  0.4816236       -0.5923944  0.5724183
## curb_weight        0.9481432  0.7766268  0.7722949       -0.7286761  0.9577844
## engine_size        1.0000000  0.7970212  0.8481923       -0.7479857  0.9570831
## bore               0.7970212  1.0000000  0.6667831       -0.9116567  0.6687257
## stroke             0.8481923  0.6667831  1.0000000       -0.7072659  0.7786286
## compressio_ratio  -0.7479857 -0.9116567 -0.7072659        1.0000000 -0.6354393
## horsepower         0.9570831  0.6687257  0.7786286       -0.6354393  1.0000000
## peak_rpm          -0.6067026 -0.8859312 -0.5945476        0.8294411 -0.5311897
## highway_mpg       -0.8558041 -0.6846533 -0.6095401        0.5997484 -0.9046454
## price              0.9009026  0.5301386  0.8157368       -0.4708300  0.9336331
##                    peak_rpm highway_mpg      price
## wheel_base       -0.6736100  -0.9009540  0.7759919
## length           -0.5943107  -0.8848982  0.7635967
## width            -0.5050842  -0.8866311  0.9220786
## height           -0.4784417  -0.6023130  0.4128322
## curb_weight      -0.6539481  -0.9232523  0.8785562
## engine_size      -0.6067026  -0.8558041  0.9009026
## bore             -0.8859312  -0.6846533  0.5301386
## stroke           -0.5945476  -0.6095401  0.8157368
## compressio_ratio  0.8294411   0.5997484 -0.4708300
## horsepower       -0.5311897  -0.9046454  0.9336331
## peak_rpm          1.0000000   0.5940493 -0.4069388
## highway_mpg       0.5940493   1.0000000 -0.8058906
## price            -0.4069388  -0.8058906  1.0000000
(R_gas =  cor(data_gas))
##                   wheel_base     length      width      height curb_weight
## wheel_base        1.00000000  0.8614599  0.7678256  0.54098717   0.7365575
## length            0.86145992  1.0000000  0.8255335  0.43963800   0.8659614
## width             0.76782563  0.8255335  1.0000000  0.19958701   0.8475670
## height            0.54098717  0.4396380  0.1995870  1.00000000   0.2154348
## curb_weight       0.73655751  0.8659614  0.8475670  0.21543475   1.0000000
## engine_size       0.55684878  0.6767871  0.7304292  0.02311942   0.8566797
## bore              0.46510306  0.5900065  0.5301144  0.19530484   0.6203486
## stroke            0.08886395  0.0767642  0.1111730 -0.07835286   0.1051811
## compressio_ratio -0.26163899 -0.2549416 -0.2483827 -0.07423001  -0.3122419
## horsepower        0.40174753  0.5855212  0.6852561 -0.09143311   0.8084057
## peak_rpm         -0.21633313 -0.2031141 -0.1111586 -0.19689019  -0.1573616
## highway_mpg      -0.60376224 -0.7682782 -0.7391737 -0.13540946  -0.8690850
## price             0.55854846  0.6756654  0.6913714  0.10022949   0.8085564
##                  engine_size        bore      stroke compressio_ratio
## wheel_base        0.55684878  0.46510306  0.08886395      -0.26163899
## length            0.67678712  0.59000649  0.07676420      -0.25494163
## width             0.73042915  0.53011437  0.11117305      -0.24838270
## height            0.02311942  0.19530484 -0.07835286      -0.07423001
## curb_weight       0.85667970  0.62034857  0.10518111      -0.31224194
## engine_size       1.00000000  0.59566464  0.22144486      -0.17922186
## bore              0.59566464  1.00000000 -0.02161393      -0.21885263
## stroke            0.22144486 -0.02161393  1.00000000      -0.31859638
## compressio_ratio -0.17922186 -0.21885263 -0.31859638       1.00000000
## horsepower        0.81933934  0.55077438  0.08134990      -0.21566458
## peak_rpm         -0.22251268 -0.29687842 -0.05609431       0.18085651
## highway_mpg      -0.70038712 -0.53010291 -0.02723847       0.41998329
## price             0.85360103  0.50613592  0.03192284      -0.18669863
##                   horsepower      peak_rpm   highway_mpg       price
## wheel_base        0.40174753 -0.2163331291 -0.6037622350  0.55854846
## length            0.58552122 -0.2031140742 -0.7682782445  0.67566537
## width             0.68525610 -0.1111585594 -0.7391737241  0.69137143
## height           -0.09143311 -0.1968901893 -0.1354094592  0.10022949
## curb_weight       0.80840572 -0.1573616348 -0.8690849912  0.80855642
## engine_size       0.81933934 -0.2225126800 -0.7003871228  0.85360103
## bore              0.55077438 -0.2968784243 -0.5301029099  0.50613592
## stroke            0.08134990 -0.0560943091 -0.0272384746  0.03192284
## compressio_ratio -0.21566458  0.1808565064  0.4199832950 -0.18669863
## horsepower        1.00000000  0.1136194449 -0.7609468761  0.76115038
## peak_rpm          0.11361944  1.0000000000  0.0003181053 -0.03720969
## highway_mpg      -0.76094688  0.0003181053  1.0000000000 -0.72663399
## price             0.76115038 -0.0372096885 -0.7266339923  1.00000000

In other to perform a better analysis, the fuel type will be taken into consideration.

ggpairs(data, 
        aes(color = `fuel_type`),
        columns = c("wheel_base", "length", "width", "height", "curb_weight", "engine_size", "bore", "stroke", "compressio_ratio", "horsepower", "peak_rpm", "highway_mpg", "price"), 
        upper = list(continuous = wrap("cor", size = 1.5)),
        lower = list(continuous = wrap("points", size = 0.5)))  + 
  theme(strip.text.x = element_text(size = 4),
        strip.text.y = element_text(size = 2),
        axis.text.x = element_text(size = 2),
        axis.text.y = element_text(size = 2))

In this case, the plot shows that the compression ratio will distinguish the cars' fuel type almost perfectly as we can see that those vehicles with a smaller compression ratio will be powered by gas while the ones with a higher compression ratio will be fueled by diesel.

In addition, the wheel_base, the width, the height, the curb_weight and the highway_mpg will help in order to difference the car's fuel as the larger cars are usually powered by diesel.

The peak revolution per minute (peak_rpm) is also a good indicator as the diesel cars usually cannot reach the higher revolutions.

2.1.2 Multicategorical variables.

2.1.3 Symboling

The symboling indicates the security of the car (being \(-2\) the value of the safest cars and \(3\) the most dangerous cars), hence we considered that it will also be a good variable to analyze.

# Mean vector
m_symb = aggregate(x = data[,c("wheel_base", "length", "width", "height", "curb_weight", "engine_size", "bore", "stroke", "compressio_ratio", "horsepower", "peak_rpm", "highway_mpg", "price")],               
          by = list(data$symboling),              
          FUN = mean)

m_symb

In this case, we can see that the safest cars have the largest means when it comes to the size, but the cars with the biggest mean in the horsepower variable are the most dangerous. Moreover, the mean of the price of the safest cars is not the largest as it is the one from the most dangerous cars.

Before analyzing the variable, we will be gathering the cars in three groups as there are groups with very few instances:

  • Risky: symboling = 3 and 2.
  • Neutral: symboling = 0 and 1.
  • Safe: symboling = -1 and -2
# Covariance matrix
data_num = data[,c("wheel_base", "length", "width", "height", "curb_weight", "engine_size", "bore", "stroke", "compressio_ratio", "horsepower", "peak_rpm", "highway_mpg", "price")]

data_r = data_num[(data$symboling==3) | (data$symboling==2),]
data_n = data_num[(data$symboling==0) | (data$symboling==1),]
data_s = data_num[(data$symboling==-1) | (data$symboling==-2),]

(S_r = cov(data_r))
##                     wheel_base        length        width        height
## wheel_base        1.412119e+01    28.9635447    2.7504296     4.0731911
## length            2.896354e+01   102.0819345   10.7541818     6.4423700
## width             2.750430e+00    10.7541818    2.1563238    -0.2477031
## height            4.073191e+00     6.4423700   -0.2477031     5.8428755
## curb_weight       5.081797e+02  2879.3997954  472.7517534  -137.7914085
## engine_size       6.724956e+00   152.9789012   31.4948860   -18.6793688
## bore              5.196786e-02     0.9502057    0.1043992    -0.0278647
## stroke            3.242247e-01     0.4026204    0.1264705    -0.1042119
## compressio_ratio  1.040162e+00    -0.3035552   -0.3027162     2.8473025
## horsepower        1.558913e+00   182.4067504   32.0248977   -25.6024839
## peak_rpm         -2.186470e+02  -249.0385739  -14.6201052  -139.6902396
## highway_mpg      -7.706458e+00   -44.0803916   -6.7057569     4.9316189
## price            -2.489644e+03 25546.0286967 4974.5707773 -2589.1207773
##                    curb_weight   engine_size         bore        stroke
## wheel_base           508.17972      6.724956   0.05196786  3.242247e-01
## length              2879.39980    152.978901   0.95020573  4.026204e-01
## width                472.75175     31.494886   0.10439918  1.264705e-01
## height              -137.79141    -18.679369  -0.02786470 -1.042119e-01
## curb_weight       151385.57802  11244.914670  61.91029807  1.959516e+01
## engine_size        11244.91467   1252.139684   7.33921683  3.651873e+00
## bore                  61.91030      7.339217   0.09677165  2.732481e-02
## stroke                19.59516      3.651873   0.02732481  1.547635e-01
## compressio_ratio    -264.60721    -29.159591  -0.28705228 -8.323846e-03
## horsepower         12280.42519   1126.902104   6.29861192  8.676505e-02
## peak_rpm          -11294.30158  -2531.706604 -43.41817650 -5.781575e+01
## highway_mpg        -2251.57773   -145.517534  -0.87630625  2.467855e-01
## price            1925998.35476 185967.537113 910.53158387 -4.924068e+02
##                  compressio_ratio    horsepower      peak_rpm   highway_mpg
## wheel_base           1.040162e+00  1.558913e+00    -218.64699 -7.706458e+00
## length              -3.035552e-01  1.824068e+02    -249.03857 -4.408039e+01
## width               -3.027162e-01  3.202490e+01     -14.62011 -6.705757e+00
## height               2.847302e+00 -2.560248e+01    -139.69024  4.931619e+00
## curb_weight         -2.646072e+02  1.228043e+04  -11294.30158 -2.251578e+03
## engine_size         -2.915959e+01  1.126902e+03   -2531.70660 -1.455175e+02
## bore                -2.870523e-01  6.298612e+00     -43.41818 -8.763063e-01
## stroke              -8.323846e-03  8.676505e-02     -57.81575  2.467855e-01
## compressio_ratio     1.030311e+01 -4.590623e+01    -298.99591  1.232608e+01
## horsepower          -4.590623e+01  1.620538e+03    5602.46932 -2.151762e+02
## peak_rpm            -2.989959e+02  5.602469e+03  192441.55465 -8.846581e+02
## highway_mpg          1.232608e+01 -2.151762e+02    -884.65809  5.389947e+01
## price               -3.663823e+03  2.358503e+05 1011449.85389 -3.092265e+04
##                          price
## wheel_base          -2489.6436
## length              25546.0287
## width                4974.5708
## height              -2589.1208
## curb_weight       1925998.3548
## engine_size        185967.5371
## bore                  910.5316
## stroke               -492.4068
## compressio_ratio    -3663.8232
## horsepower         235850.2583
## peak_rpm          1011449.8539
## highway_mpg        -30922.6470
## price            49733363.7884
(S_n = cov(data_n))
##                     wheel_base        length         width        height
## wheel_base          35.8007328    68.5658099    11.4907741    7.02717355
## length              68.5658099   162.4269215    25.0266494   14.25629270
## width               11.4907741    25.0266494     5.3023829    1.56865702
## height               7.0271736    14.2562927     1.5686570    4.74166391
## curb_weight       2904.4136433  6416.9902548  1107.3962328  451.01820937
## engine_size        190.5608884   434.8481749    79.9675895   10.59264463
## bore                 1.0757191     2.4300220     0.4018499    0.17898567
## stroke               0.2922473     0.6374983     0.1225238    0.01332107
## compressio_ratio     4.9689562     7.0893218     1.1866580    1.78077252
## horsepower         125.0676860   306.9158196    66.7429063   -5.12024105
## peak_rpm         -1130.2396694 -2393.5192837  -273.2227961 -404.54579890
## highway_mpg        -28.0149587   -66.3832438   -11.3938912   -4.09968320
## price            37270.5190083 80896.4136846 14023.4792218 4041.52739669
##                    curb_weight   engine_size          bore        stroke
## wheel_base          2904.41364    190.560888  1.075719e+00   0.292247314
## length              6416.99025    434.848175  2.430022e+00   0.637498347
## width               1107.39623     79.967590  4.018499e-01   0.122523760
## height               451.01821     10.592645  1.789857e-01   0.013321074
## curb_weight       310501.95275  21949.710675  1.032828e+02  34.410865014
## engine_size        21949.71067   2092.027824  7.531585e+00   3.646768595
## bore                 103.28277      7.531585  6.863551e-02  -0.008493747
## stroke                34.41087      3.646769 -8.493747e-03   0.093927948
## compressio_ratio     330.91031      8.709130  1.622938e-01   0.278833134
## horsepower         17368.82665   1554.848072  5.911997e+00   1.396302342
## peak_rpm          -91600.60262  -6300.199725 -5.718815e+01   2.894111570
## highway_mpg        -3132.12741   -225.162466 -1.057603e+00  -0.136590220
## price            3793972.91618 334286.632507  1.205517e+03 557.756831956
##                  compressio_ratio    horsepower      peak_rpm   highway_mpg
## wheel_base              4.9689562    125.067686 -1.130240e+03 -2.801496e+01
## length                  7.0893218    306.915820 -2.393519e+03 -6.638324e+01
## width                   1.1866580     66.742906 -2.732228e+02 -1.139389e+01
## height                  1.7807725     -5.120241 -4.045458e+02 -4.099683e+00
## curb_weight           330.9103051  17368.826653 -9.160060e+04 -3.132127e+03
## engine_size             8.7091302   1554.848072 -6.300200e+03 -2.251625e+02
## bore                    0.1622938      5.911997 -5.718815e+01 -1.057603e+00
## stroke                  0.2788331      1.396302  2.894112e+00 -1.365902e-01
## compressio_ratio       15.4488844    -27.199467 -9.223371e+02  7.127618e+00
## horsepower            -27.1994669   1647.697658  1.403805e+03 -2.126931e+02
## peak_rpm             -922.3370523   1403.805096  2.487855e+05  1.621419e+02
## highway_mpg             7.1276185   -212.693113  1.621419e+02  4.735523e+01
## price                1257.3287404 238751.093044 -9.798452e+05 -4.056428e+04
##                          price
## wheel_base          37270.5190
## length              80896.4137
## width               14023.4792
## height               4041.5274
## curb_weight       3793972.9162
## engine_size        334286.6325
## bore                 1205.5165
## stroke                557.7568
## compressio_ratio     1257.3287
## horsepower         238751.0930
## peak_rpm          -979845.1550
## highway_mpg        -40564.2756
## price            67651109.4315
(S_s = cov(data_s))
##                     wheel_base        length         width        height
## wheel_base          19.1317333    31.1023333  7.713533e+00    2.44600000
## length              31.1023333    69.6958333  1.358700e+01    3.33083333
## width                7.7135333    13.5870000  3.689767e+00    0.25383333
## height               2.4460000     3.3308333  2.538333e-01    3.49250000
## curb_weight       1602.4173333  3395.0525000  7.078562e+02  245.48083333
## engine_size        104.5795000   199.8533333  4.473017e+01    8.99666667
## bore                 0.3467750     0.9960833  1.600333e-01    0.15883333
## stroke              -0.1217033    -0.6197833 -3.381833e-02   -0.02125833
## compressio_ratio    10.0421667    11.9750000  5.140667e+00    1.16833333
## horsepower          43.5953333   144.3200000  1.993600e+01   -2.82833333
## peak_rpm          -206.2166667   802.4166667 -1.449000e+02   43.00000000
## highway_mpg        -11.2403333   -27.0200000 -4.641833e+00   -1.36333333
## price            28782.2700000 53710.8333333  1.304846e+04 2580.45000000
##                    curb_weight   engine_size         bore        stroke
## wheel_base          1602.41733  1.045795e+02   0.34677500   -0.12170333
## length              3395.05250  1.998533e+02   0.99608333   -0.61978333
## width                707.85617  4.473017e+01   0.16003333   -0.03381833
## height               245.48083  8.996667e+00   0.15883333   -0.02125833
## curb_weight       190111.75667  1.091884e+04  37.24691667  -11.20995000
## engine_size        10918.83667  8.397100e+02   1.28908333   -0.71976667
## bore                  37.24692  1.289083e+00   0.05435833   -0.02667417
## stroke               -11.20995 -7.197667e-01  -0.02667417    0.04489933
## compressio_ratio    1043.53417  4.222333e+01  -0.23037500    0.52755833
## horsepower          7194.87000  4.135183e+02   1.88241667   -2.51660000
## peak_rpm           13485.75000 -5.685833e+02  51.28333333  -80.14333333
## highway_mpg        -1410.61167 -8.670167e+01  -0.31033333    0.39426667
## price            2939786.07500  1.810252e+05 496.28000000 -175.54641667
##                  compressio_ratio    horsepower      peak_rpm   highway_mpg
## wheel_base             10.0421667     43.595333    -206.21667 -1.124033e+01
## length                 11.9750000    144.320000     802.41667 -2.702000e+01
## width                   5.1406667     19.936000    -144.90000 -4.641833e+00
## height                  1.1683333     -2.828333      43.00000 -1.363333e+00
## curb_weight          1043.5341667   7194.870000   13485.75000 -1.410612e+03
## engine_size            42.2233333    413.518333    -568.58333 -8.670167e+01
## bore                   -0.2303750      1.882417      51.28333 -3.103333e-01
## stroke                  0.5275583     -2.516600     -80.14333  3.942667e-01
## compressio_ratio       29.6000000    -31.651667   -1183.08333  3.183333e-01
## horsepower            -31.6516667    706.993333    5210.16667 -9.259333e+01
## peak_rpm            -1183.0833333   5210.166667  237516.66667 -6.805833e+02
## highway_mpg             0.3183333    -92.593333    -680.58333  1.569333e+01
## price               17480.6125000 101885.350000 -227227.50000 -2.232431e+04
##                          price
## wheel_base          28782.2700
## length              53710.8333
## width               13048.4558
## height               2580.4500
## curb_weight       2939786.0750
## engine_size        181025.2250
## bore                  496.2800
## stroke               -175.5464
## compressio_ratio    17480.6125
## horsepower         101885.3500
## peak_rpm          -227227.5000
## highway_mpg        -22324.3083
## price            51973072.7500
# Correlation matrix

(R_r = cor(data_r))
##                   wheel_base      length       width      height curb_weight
## wheel_base        1.00000000  0.76285443  0.49843433  0.44842093   0.3475678
## length            0.76285443  1.00000000  0.72484632  0.26378974   0.7324623
## width             0.49843433  0.72484632  1.00000000 -0.06978485   0.8274350
## height            0.44842093  0.26378974 -0.06978485  1.00000000  -0.1465098
## curb_weight       0.34756776  0.73246229  0.82743504 -0.14650976   1.0000000
## engine_size       0.05057402  0.42788859  0.60611728 -0.21838484   0.8167470
## bore              0.04445546  0.30232156  0.22854200 -0.03705672   0.5115011
## stroke            0.21931898  0.10129477  0.21892627 -0.10958965   0.1280184
## compressio_ratio  0.08623455 -0.00936007 -0.06422359  0.36697469  -0.2118728
## horsepower        0.01030521  0.44847361  0.54175284 -0.26311113   0.7840461
## peak_rpm         -0.13263510 -0.05618789 -0.02269573 -0.13173565  -0.0661710
## highway_mpg      -0.27933604 -0.59426306 -0.62201170  0.27789705  -0.7882290
## price            -0.09394587  0.35852961  0.48036871 -0.15188507   0.7019234
##                  engine_size        bore       stroke compressio_ratio
## wheel_base        0.05057402  0.04445546  0.219318981      0.086234548
## length            0.42788859  0.30232156  0.101294768     -0.009360070
## width             0.60611728  0.22854200  0.218926273     -0.064223588
## height           -0.21838484 -0.03705672 -0.109589649      0.366974687
## curb_weight       0.81674695  0.51150109  0.128018362     -0.211872818
## engine_size       1.00000000  0.66672885  0.262334243     -0.256726594
## bore              0.66672885  1.00000000  0.223279451     -0.287476760
## stroke            0.26233424  0.22327945  1.000000000     -0.006591823
## compressio_ratio -0.25672659 -0.28747676 -0.006591823      1.000000000
## horsepower        0.79109790  0.50296898  0.005478743     -0.355269490
## peak_rpm         -0.16309384 -0.31816183 -0.335013594     -0.212339916
## highway_mpg      -0.56014011 -0.38369803  0.085446263      0.523056478
## price             0.74522400  0.41504675 -0.177486684     -0.161855109
##                    horsepower    peak_rpm highway_mpg       price
## wheel_base        0.010305206 -0.13263510 -0.27933604 -0.09394587
## length            0.448473609 -0.05618789 -0.59426306  0.35852961
## width             0.541752838 -0.02269573 -0.62201170  0.48036871
## height           -0.263111128 -0.13173565  0.27789705 -0.15188507
## curb_weight       0.784046071 -0.06617100 -0.78822904  0.70192344
## engine_size       0.791097898 -0.16309384 -0.56014011  0.74522400
## bore              0.502968978 -0.31816183 -0.38369803  0.41504675
## stroke            0.005478743 -0.33501359  0.08544626 -0.17748668
## compressio_ratio -0.355269490 -0.21233992  0.52305648 -0.16185511
## horsepower        1.000000000  0.31724917 -0.72806881  0.83077395
## peak_rpm          0.317249166  1.00000000 -0.27468425  0.32694222
## highway_mpg      -0.728068810 -0.27468425  1.00000000 -0.59725595
## price             0.830773950  0.32694222 -0.59725595  1.00000000
(R_n = cor(data_n))
##                  wheel_base     length      width      height curb_weight
## wheel_base        1.0000000  0.8991510  0.8340033  0.53934833   0.8711247
## length            0.8991510  1.0000000  0.8527823  0.51370313   0.9035875
## width             0.8340033  0.8527823  1.0000000  0.31284345   0.8630488
## height            0.5393483  0.5137031  0.3128435  1.00000000   0.3717034
## curb_weight       0.8711247  0.9035875  0.8630488  0.37170339   1.0000000
## engine_size       0.6963123  0.7459761  0.7592674  0.10635436   0.8612176
## bore              0.6862439  0.7277912  0.6661221  0.31374619   0.7074916
## stroke            0.1593702  0.1632122  0.1736150  0.01996073   0.2014958
## compressio_ratio  0.2112860  0.1415230  0.1311117  0.20806283   0.1510879
## horsepower        0.5149445  0.5932688  0.7140536 -0.05792770   0.7678910
## peak_rpm         -0.3787146 -0.3765264 -0.2378860 -0.37246872  -0.3295745
## highway_mpg      -0.6803933 -0.7569121 -0.7190391 -0.27359044  -0.8168142
## price             0.7573245  0.7717254  0.7404279  0.22565382   0.8277984
##                  engine_size       bore      stroke compressio_ratio
## wheel_base        0.69631233  0.6862439  0.15937021       0.21128596
## length            0.74597611  0.7277912  0.16321221       0.14152303
## width             0.75926736  0.6661221  0.17361504       0.13111170
## height            0.10635436  0.3137462  0.01996073       0.20806283
## curb_weight       0.86121761  0.7074916  0.20149580       0.15108786
## engine_size       1.00000000  0.6285332  0.26015194       0.04844428
## bore              0.62853323  1.0000000 -0.10578582       0.15760822
## stroke            0.26015194 -0.1057858  1.00000000       0.23147197
## compressio_ratio  0.04844428  0.1576082  0.23147197       1.00000000
## horsepower        0.83746269  0.5559311  0.11223887      -0.17047986
## peak_rpm         -0.27615829 -0.4376422  0.01893239      -0.47046637
## highway_mpg      -0.71536596 -0.5866298 -0.06476468       0.26351897
## price             0.88858263  0.5594500  0.22126362       0.03889223
##                   horsepower    peak_rpm highway_mpg       price
## wheel_base        0.51494446 -0.37871463 -0.68039331  0.75732452
## length            0.59326879 -0.37652639 -0.75691214  0.77172541
## width             0.71405359 -0.23788600 -0.71903914  0.74042789
## height           -0.05792770 -0.37246872 -0.27359044  0.22565382
## curb_weight       0.76789095 -0.32957454 -0.81681416  0.82779843
## engine_size       0.83746269 -0.27615829 -0.71536596  0.88858263
## bore              0.55593111 -0.43764215 -0.58662976  0.55945001
## stroke            0.11223887  0.01893239 -0.06476468  0.22126362
## compressio_ratio -0.17047986 -0.47046637  0.26351897  0.03889223
## horsepower        1.00000000  0.06933548 -0.76143120  0.71510390
## peak_rpm          0.06933548  1.00000000  0.04723878 -0.23884023
## highway_mpg      -0.76143120  0.04723878  1.00000000 -0.71667502
## price             0.71510390 -0.23884023 -0.71667502  1.00000000
(R_s = cor(data_s))
##                   wheel_base     length       width      height curb_weight
## wheel_base        1.00000000  0.8517505  0.91807202  0.29923410  0.84022188
## length            0.85175051  1.0000000  0.84726783  0.21349184  0.93269275
## width             0.91807202  0.8472678  1.00000000  0.07071003  0.84516418
## height            0.29923410  0.2134918  0.07071003  1.00000000  0.30126237
## curb_weight       0.84022188  0.9326927  0.84516418  0.30126237  1.00000000
## engine_size       0.82509649  0.8261204  0.80359348  0.16613027  0.86418599
## bore              0.34004591  0.5117516  0.35733696  0.36453596  0.36639773
## stroke           -0.13131229 -0.3503616 -0.08308695 -0.05368353 -0.12133309
## compressio_ratio  0.42199219  0.2636491  0.49189696  0.11490865  0.43990240
## horsepower        0.37484799  0.6501528  0.39032909 -0.05691867  0.62059862
## peak_rpm         -0.09673855  0.1972193 -0.15478251  0.04721209  0.06346343
## highway_mpg      -0.64870080 -0.8170043 -0.61000373 -0.18415192 -0.81666771
## price             0.91276439  0.8924200  0.94225938  0.19153054  0.93523778
##                  engine_size       bore      stroke compressio_ratio
## wheel_base         0.8250965  0.3400459 -0.13131229       0.42199219
## length             0.8261204  0.5117516 -0.35036163       0.26364908
## width              0.8035935  0.3573370 -0.08308695       0.49189696
## height             0.1661303  0.3645360 -0.05368353       0.11490865
## curb_weight        0.8641860  0.3663977 -0.12133309       0.43990240
## engine_size        1.0000000  0.1908022 -0.11722146       0.26781938
## bore               0.1908022  1.0000000 -0.53993048      -0.18161697
## stroke            -0.1172215 -0.5399305  1.00000000       0.45761967
## compressio_ratio   0.2678194 -0.1816170  0.45761967       1.00000000
## horsepower         0.5366887  0.3036512 -0.44666980      -0.21879786
## peak_rpm          -0.0402608  0.4513320 -0.77606913      -0.44619267
## highway_mpg       -0.7552751 -0.3359988  0.46969130       0.01476994
## price              0.8665333  0.2952599 -0.11491662       0.44567841
##                   horsepower    peak_rpm highway_mpg       price
## wheel_base        0.37484799 -0.09673855 -0.64870080  0.91276439
## length            0.65015282  0.19721933 -0.81700429  0.89241998
## width             0.39032909 -0.15478251 -0.61000373  0.94225938
## height           -0.05691867  0.04721209 -0.18415192  0.19153054
## curb_weight       0.62059862  0.06346343 -0.81666771  0.93523778
## engine_size       0.53668869 -0.04026080 -0.75527513  0.86653326
## bore              0.30365122  0.45133201 -0.33599883  0.29525993
## stroke           -0.44666980 -0.77606913  0.46969130 -0.11491662
## compressio_ratio -0.21879786 -0.44619267  0.01476994  0.44567841
## horsepower        1.00000000  0.40206560 -0.87905178  0.53151422
## peak_rpm          0.40206560  1.00000000 -0.35251424 -0.06467324
## highway_mpg      -0.87905178 -0.35251424  1.00000000 -0.78168399
## price             0.53151422 -0.06467324 -0.78168399  1.00000000

We plot the results in order to analyze them better.

data$risk <- ifelse(data$symboling==3, 1, 
                    ifelse(data$symboling==2, 1, 
                           ifelse(data$symboling==1, 0, 
                                  ifelse(data$symboling==0, 0, 
                                         ifelse(data$symboling==-1, -1, 
                                                ifelse(data$symboling==-2, -1, 0))))))
data$risk <- as.factor(data$risk)
  
ggpairs(data, 
        aes(color = risk),
        columns = c("wheel_base", "length", "width", "height", "curb_weight", "engine_size", "bore", "stroke", "compressio_ratio", "horsepower", "peak_rpm", "highway_mpg", "price"), 
        upper = list(continuous = wrap("cor", size = 1.5)),
        lower = list(continuous = wrap("points", size = 0.5)))  + 
  theme(strip.text.x = element_text(size = 4),
        strip.text.y = element_text(size = 2),
        axis.text.x = element_text(size = 2),
        axis.text.y = element_text(size = 2))

In the plot, it can be seen that the least secure cars are usually the smallest but they are not the cheapest. However, the safest cars are not the largest either but they usually are in the middle when in comes to the curb_weight and the ones with the largest bores without a very high nor very low peak of the revolutions per minute.

For the most part, the risk of the cars cannot be predicted easily with the plot.

3 Outlier's analysis.

The values that are very different from the rest of the data set are considered outliers.

It is important to study these observations and have them identifed as they can cause a lot of problems in various studies, such as the dimension reduction, due to their influence on the mean vector, the covariance and the correlation matrixes.

In order to determine the existence of outliers in the data set, we are going to estimate the mean vector and the covariance and correlation matrixes with the minimum covariance determinant (MCD) which is based on the Mahalanobis distance:

\[ D_M (x, \mu) = \sqrt{(x- \mu)' \Sigma^{-1} (x- \mu)'}\]

As we have two well-differentiated groups by the fuel_type in our data, we will be searching the outliers in each group.

3.1 Gas

MCD_est = CovMcd(data_gas,alpha=0.85,nsamp="deterministic")

# Robust mean vector
m_MCD = MCD_est$center
m_MCD
##       wheel_base           length            width           height 
##        97.371053       171.132237        65.276974        53.428947 
##      curb_weight      engine_size             bore           stroke 
##      2391.296053       114.355263         3.289079         3.226579 
## compressio_ratio       horsepower         peak_rpm      highway_mpg 
##         8.882368        96.388158      5214.802632        31.427632 
##            price 
##     10435.282895
# Robust covariance matrix
S_MCD = MCD_est$cov
S_MCD
##                     wheel_base        length         width        height
## wheel_base          20.4371612    41.7490398    6.08824230    6.44120179
## length              41.7490398   125.6007817   15.20309419   12.56314189
## width                6.0882423    15.2030942    2.69775184    1.11452216
## height               6.4412018    12.5631419    1.11452216    6.49132532
## curb_weight       1433.1743038  4087.1628149  562.37705338  292.85149197
## engine_size         71.6425389   211.7800854   29.33238001   10.27578643
## bore                 0.7409726     2.1719858    0.26498173    0.22792689
## stroke               0.1715724     0.3798969    0.08985876   -0.04554487
## compressio_ratio    -0.4527135    -1.3745719   -0.27644553    0.08943117
## horsepower          78.5574931   238.4507530   34.89745904    3.76964385
## peak_rpm          -282.8869775  -853.9645042  -81.30783436 -195.60660522
## highway_mpg        -14.9533718   -47.2197036   -6.71061681   -2.04151058
## price            13686.4232673 37712.3770323 5513.22762220 2256.79250451
##                    curb_weight  engine_size          bore       stroke
## wheel_base          1433.17430    71.642539   0.740972613  0.171572449
## length              4087.16281   211.780085   2.171985760  0.379896889
## width                562.37705    29.332380   0.264981732  0.089858761
## height               292.85149    10.275786   0.227926891 -0.045544872
## curb_weight       174925.57094  9408.243172  82.489153156 18.909528354
## engine_size         9408.24317   701.130429   5.171584698  3.256370332
## bore                  82.48915     5.171585   0.083140528 -0.007573647
## stroke                18.90953     3.256370  -0.007573647  0.105715169
## compressio_ratio    -102.62361    -5.832309  -0.056810960 -0.073680239
## horsepower         11344.79830   646.290662   4.788890677  1.670302151
## peak_rpm          -20457.90968 -3401.691299 -59.089098384  1.661626115
## highway_mpg        -2161.36730  -104.083691  -0.900746622 -0.079903321
## price            1581059.27487 76573.003428 694.548032619 49.767389563
##                  compressio_ratio    horsepower      peak_rpm   highway_mpg
## wheel_base            -0.45271348     78.557493   -282.886978 -1.495337e+01
## length                -1.37457186    238.450753   -853.964504 -4.721970e+01
## width                 -0.27644553     34.897459    -81.307834 -6.710617e+00
## height                 0.08943117      3.769644   -195.606605 -2.041511e+00
## curb_weight         -102.62360575  11344.798302 -20457.909675 -2.161367e+03
## engine_size           -5.83230872    646.290662  -3401.691299 -1.040837e+02
## bore                  -0.05681096      4.788891    -59.089098 -9.007466e-01
## stroke                -0.07368024      1.670302      1.661626 -7.990332e-02
## compressio_ratio       0.52194240     -9.358664     63.667176  1.989785e+00
## horsepower            -9.35866355    999.421160   1499.055711 -1.568303e+02
## peak_rpm              63.66717571   1499.055711 227560.012260 -1.517017e+02
## highway_mpg            1.98978490   -156.830294   -151.701675  3.700268e+01
## price               -806.22226055 111652.275889 178774.782682 -2.085700e+04
##                          price
## wheel_base        1.368642e+04
## length            3.771238e+04
## width             5.513228e+03
## height            2.256793e+03
## curb_weight       1.581059e+06
## engine_size       7.657300e+04
## bore              6.945480e+02
## stroke            4.976739e+01
## compressio_ratio -8.062223e+02
## horsepower        1.116523e+05
## peak_rpm          1.787748e+05
## highway_mpg      -2.085700e+04
## price             1.874260e+07
# Robust correlation matrix
R_MCD = cov2cor(S_MCD)
R_MCD
##                  wheel_base     length      width      height curb_weight
## wheel_base        1.0000000  0.8240243  0.8199373  0.55922932   0.7579876
## length            0.8240243  1.0000000  0.8259140  0.43998250   0.8719651
## width             0.8199373  0.8259140  1.0000000  0.26633057   0.8186532
## height            0.5592293  0.4399825  0.2663306  1.00000000   0.2748237
## curb_weight       0.7579876  0.8719651  0.8186532  0.27482365   1.0000000
## engine_size       0.5984960  0.7136570  0.6744456  0.15231721   0.8495373
## bore              0.5684411  0.6721318  0.5595109  0.31025764   0.6840121
## stroke            0.1167263  0.1042560  0.1682638 -0.05497992   0.1390546
## compressio_ratio -0.1386123 -0.1697696 -0.2329685  0.04858598  -0.3396326
## horsepower        0.5496713  0.6730204  0.6720762  0.04680146   0.8580169
## peak_rpm         -0.1311761 -0.1597334 -0.1037728 -0.16094180  -0.1025384
## highway_mpg      -0.5437662 -0.6926446 -0.6716528 -0.13172504  -0.8495429
## price             0.6993020  0.7772716  0.7753361  0.20460207   0.8731857
##                  engine_size        bore      stroke compressio_ratio
## wheel_base         0.5984960  0.56844111  0.11672627      -0.13861233
## length             0.7136570  0.67213175  0.10425596      -0.16976963
## width              0.6744456  0.55951089  0.16826383      -0.23296855
## height             0.1523172  0.31025764 -0.05497992       0.04858598
## curb_weight        0.8495373  0.68401213  0.13905461      -0.33963257
## engine_size        1.0000000  0.67735734  0.37823849      -0.30488061
## bore               0.6773573  1.00000000 -0.08078486      -0.27271841
## stroke             0.3782385 -0.08078486  1.00000000      -0.31366863
## compressio_ratio  -0.3048806 -0.27271841 -0.31366863       1.00000000
## horsepower         0.7720656  0.52535668  0.16249953      -0.40975879
## peak_rpm          -0.2693069 -0.42958885  0.01071314       0.18473804
## highway_mpg       -0.6461996 -0.51354635 -0.04039983       0.45277063
## price              0.6679768  0.55639214  0.03535585      -0.25776775
##                   horsepower    peak_rpm highway_mpg       price
## wheel_base        0.54967130 -0.13117609 -0.54376625  0.69930200
## length            0.67302044 -0.15973338 -0.69264465  0.77727165
## width             0.67207623 -0.10377277 -0.67165282  0.77533611
## height            0.04680146 -0.16094180 -0.13172504  0.20460207
## curb_weight       0.85801693 -0.10253842 -0.84954292  0.87318565
## engine_size       0.77206560 -0.26930692 -0.64619959  0.66797678
## bore              0.52535668 -0.42958885 -0.51354635  0.55639214
## stroke            0.16249953  0.01071314 -0.04039983  0.03535585
## compressio_ratio -0.40975879  0.18473804  0.45277063 -0.25776775
## horsepower        1.00000000  0.09940209 -0.81552839  0.81579004
## peak_rpm          0.09940209  1.00000000 -0.05227882  0.08656521
## highway_mpg      -0.81552839 -0.05227882  1.00000000 -0.79199127
## price             0.81579004  0.08656521 -0.79199127  1.00000000

In the graph shown below, we can see that the first eigenvalue of the covariance matrix obtained with the data is larger that the one calculated with MCD. The rest are more similar.

# Compare eigenvalues of both covariance matrices
eval_S = eigen(S)$values
eval_S_MCD = eigen(S_MCD)$values

min_y = min(cbind(eval_S,eval_S_MCD)) - 1
max_y = max(cbind(eval_S,eval_S_MCD)) + 1

plot(1:13,eval_S,col="blue",type="b",xlab="Number",ylab="Eigenvalues",pch=19,
     ylim=c(min_y,max_y),main="Comparison of eigenvalues")
points(1:13,eval_S_MCD,col="red",type="b",pch=19)
legend(7,60000000, legend=c("Eigenvalues of S","Eigenvalues of S MCD"),
       col=c("Blue","Red"),lty=1,cex=1.2)

A comparison between the correlation matrixes can also help in order to know the influence of the outliers.

# Compare eigenvalues of both correlation matrices
par(mfrow=c(1,2))
corrplot(R)
corrplot(R_MCD)

In this case, we can see that there is not a big difference in the correlations obtained with the data and through the MCD estimation, but we can see that there are some that have been lowered such as horsepower and wheel_base and bore and width.

A graphical analysis has been carried out in order to detect the outliers. In this graphs we are representing the Mahalanobis distances, and the outliers are graphed in green.

# Show the squared Mahalanobis distances with the final set of non_outliers
n = nrow(data_gas)
p = ncol(data_gas)

X_sq_Mah_MCD = MCD_est$mah
col_outliers_Mah_MCD = rep("black",n)
outliers_Mah_MCD = which(X_sq_Mah_MCD>qchisq(.99^(1/n),p))
outliers_Mah_MCD
##  [1]  10  13  14  15  16  17  18  31  48  49  50  66  67  68  69  98  99 102 104
## [20] 114 115 116 117 118 123 144 184
col_outliers_Mah_MCD[outliers_Mah_MCD] = "green"
par(mfrow=c(1,2))
plot(1:n,X_sq_Mah_MCD,pch=19,col=col_outliers_Mah_MCD,
     main="Squared Mahalanobis distances",xlab="Observation",ylab="Squared Mahalanobis distance")
abline(h=qchisq(.99^(1/n),p),lwd=3,col="blue")
plot(1:n,log(X_sq_Mah_MCD),pch=19,col=col_outliers_Mah_MCD,
     main="Log of squared Mahalanobis distances",
     xlab="Observation",ylab="Log of squared Mahalanobis distance")
abline(h=log(qchisq(.99^(1/n),p)),lwd=3,col="blue")

# Show potential outliers

ggpairs(data_gas, mapping =  aes(color = col_outliers_Mah_MCD),       
        columns = c("wheel_base", "length", "width", "height", "curb_weight", "engine_size", "bore", "stroke", "compressio_ratio", "horsepower", "peak_rpm", "highway_mpg", "price"), 
        upper = list(continuous = wrap("cor", size = 1.5)),
        lower = list(continuous = wrap("points", size = 0.5)))  + 
  theme(strip.text.x = element_text(size = 4),
        strip.text.y = element_text(size = 2),
        axis.text.x = element_text(size = 2),
        axis.text.y = element_text(size = 2))

After representing the outliers in the dispersion matrix, we can see that the cars that are considered outliers are those larger, heavier and especially those that consume the most.

3.2 Diesel

MCD_est = CovMcd(data_diesel,alpha=0.85,nsamp="deterministic")

# Robust mean vector
m_MCD = MCD_est$center
m_MCD
##       wheel_base           length            width           height 
##       104.910526       182.889474        67.621053        55.905263 
##      curb_weight      engine_size             bore           stroke 
##      2945.210526       137.421053         3.394737         3.486316 
## compressio_ratio       horsepower         peak_rpm      highway_mpg 
##        22.010526        86.000000      4415.789474        33.947368 
##            price 
##     16298.105263
# Robust covariance matrix
S_MCD = MCD_est$cov
S_MCD
##                    wheel_base        length         width       height
## wheel_base          86.795121    139.937557    24.4073146   15.9363531
## length             139.937557    241.017526    41.8823564   28.0529800
## width               24.407315     41.882356     9.0150167    4.2441723
## height              15.936353     28.052980     4.2441723    5.3926631
## curb_weight       6937.389803  11428.471180  2253.1987327 1265.8051658
## engine_size        354.285208    568.448552   124.1202378   54.7975104
## bore                 2.641575      4.069196     0.7176557    0.4894598
## stroke               1.077094      1.648904     0.3717788    0.1768813
## compressio_ratio    -8.396076    -12.879525    -2.2227294   -1.6383823
## horsepower         287.145148    473.019575   104.7812944   46.1514521
## peak_rpm         -1887.961621  -2689.999951  -413.5384227 -334.8253887
## highway_mpg        -93.570371   -152.519670   -29.4055778  -15.3759508
## price            76002.744832 124358.217679 29820.5592894 9582.6370055
##                   curb_weight   engine_size          bore        stroke
## wheel_base          6937.3898    354.285208    2.64157489    1.07709362
## length             11428.4712    568.448552    4.06919621    1.64890424
## width               2253.1987    124.120238    0.71765566    0.37177878
## height              1265.8052     54.797510    0.48945981    0.17688128
## curb_weight       622537.5009  33718.919720  230.51568180  100.94752351
## engine_size        33718.9197   2020.015327   13.72864677    6.11883757
## bore                 230.5157     13.728647    0.15215474    0.04245841
## stroke               100.9475      6.118838    0.04245841    0.02472079
## compressio_ratio    -722.3404    -40.473753   -0.43865177   -0.12896850
## horsepower         27252.7605   1547.057917    8.98859679    4.53093521
## peak_rpm         -153515.1671  -8325.450269 -109.68308740  -31.64925906
## highway_mpg        -8139.1464   -435.221201   -2.82653181   -1.16512558
## price            7405479.7778 433618.172372 2052.23483068 1419.86335211
##                  compressio_ratio    horsepower      peak_rpm   highway_mpg
## wheel_base             -8.3960762    287.145148 -1.887962e+03    -93.570371
## length                -12.8795245    473.019575 -2.690000e+03   -152.519670
## width                  -2.2227294    104.781294 -4.135384e+02    -29.405578
## height                 -1.6383823     46.151452 -3.348254e+02    -15.375951
## curb_weight          -722.3404477  27252.760532 -1.535152e+05  -8139.146447
## engine_size           -40.4737528   1547.057917 -8.325450e+03   -435.221201
## bore                   -0.4386518      8.988597 -1.096831e+02     -2.826532
## stroke                 -0.1289685      4.530935 -3.164926e+01     -1.165126
## compressio_ratio        1.3383806    -27.876789  3.359478e+02      8.815860
## horsepower            -27.8767894   1300.333566 -5.626759e+03   -369.211616
## peak_rpm              335.9478049  -5626.758788  1.039242e+05   1910.122146
## highway_mpg             8.8158598   -369.211616  1.910122e+03    127.840328
## price               -6201.3685412 361153.376098 -1.204764e+06 -96747.321187
##                          price
## wheel_base           76002.745
## length              124358.218
## width                29820.559
## height                9582.637
## curb_weight        7405479.778
## engine_size         433618.172
## bore                  2052.235
## stroke                1419.863
## compressio_ratio     -6201.369
## horsepower          361153.376
## peak_rpm          -1204764.143
## highway_mpg         -96747.321
## price            116329651.227
# Robust correlation matrix
R_MCD = cov2cor(S_MCD)
R_MCD
##                  wheel_base     length      width     height curb_weight
## wheel_base        1.0000000  0.9675253  0.8725475  0.7366134   0.9437689
## length            0.9675253  1.0000000  0.8985113  0.7781316   0.9329988
## width             0.8725475  0.8985113  1.0000000  0.6087067   0.9511160
## height            0.7366134  0.7781316  0.6087067  1.0000000   0.6908479
## curb_weight       0.9437689  0.9329988  0.9511160  0.6908479   1.0000000
## engine_size       0.8461123  0.8146846  0.9197758  0.5250273   0.9508534
## bore              0.7268964  0.6719571  0.6127596  0.5403469   0.7489882
## stroke            0.7353171  0.6755228  0.7875356  0.4844502   0.8137327
## compressio_ratio -0.7790026 -0.7171099 -0.6399021 -0.6098511  -0.7913514
## horsepower        0.8547251  0.8449430  0.9677721  0.5511329   0.9578563
## peak_rpm         -0.6286184 -0.5374890 -0.4272422 -0.4472582  -0.6035457
## highway_mpg      -0.8882941 -0.8688956 -0.8661887 -0.5856072  -0.9123514
## price             0.7563737  0.7426855  0.9208466  0.3825940   0.8702123
##                  engine_size       bore     stroke compressio_ratio horsepower
## wheel_base         0.8461123  0.7268964  0.7353171       -0.7790026  0.8547251
## length             0.8146846  0.6719571  0.6755228       -0.7171099  0.8449430
## width              0.9197758  0.6127596  0.7875356       -0.6399021  0.9677721
## height             0.5250273  0.5403469  0.4844502       -0.6098511  0.5511329
## curb_weight        0.9508534  0.7489882  0.8137327       -0.7913514  0.9578563
## engine_size        1.0000000  0.7830827  0.8658855       -0.7784063  0.9545570
## bore               0.7830827  1.0000000  0.6922929       -0.9720471  0.6390308
## stroke             0.8658855  0.6922929  1.0000000       -0.7090269  0.7991519
## compressio_ratio  -0.7784063 -0.9720471 -0.7090269        1.0000000 -0.6682294
## horsepower         0.9545570  0.6390308  0.7991519       -0.6682294  1.0000000
## peak_rpm          -0.5746084 -0.8722451 -0.6244160        0.9007901 -0.4840303
## highway_mpg       -0.8564436 -0.6408807 -0.6554019        0.6739706 -0.9055533
## price              0.8945094  0.4877970  0.8372790       -0.4969956  0.9285795
##                    peak_rpm highway_mpg      price
## wheel_base       -0.6286184  -0.8882941  0.7563737
## length           -0.5374890  -0.8688956  0.7426855
## width            -0.4272422  -0.8661887  0.9208466
## height           -0.4472582  -0.5856072  0.3825940
## curb_weight      -0.6035457  -0.9123514  0.8702123
## engine_size      -0.5746084  -0.8564436  0.8945094
## bore             -0.8722451  -0.6408807  0.4877970
## stroke           -0.6244160  -0.6554019  0.8372790
## compressio_ratio  0.9007901   0.6739706 -0.4969956
## horsepower       -0.4840303  -0.9055533  0.9285795
## peak_rpm          1.0000000   0.5240453 -0.3464962
## highway_mpg       0.5240453   1.0000000 -0.7933408
## price            -0.3464962  -0.7933408  1.0000000

In the graph shown below, we can see that the first eigenvalue of the covariance matrix obtained with the data is smaller that the one calculated with MCD. The rest are more similar.

# Compare eigenvalues of both covariance matrices
eval_S = eigen(S)$values
eval_S_MCD = eigen(S_MCD)$values

min_y = min(cbind(eval_S,eval_S_MCD)) - 1
max_y = max(cbind(eval_S,eval_S_MCD)) + 1

plot(1:13,eval_S,col="blue",type="b",xlab="Number",ylab="Eigenvalues",pch=19,
     ylim=c(min_y,max_y),main="Comparison of eigenvalues")
points(1:13,eval_S_MCD,col="red",type="b",pch=19)
legend(7,60000000, legend=c("Eigenvalues of S","Eigenvalues of S MCD"),
       col=c("Blue","Red"),lty=1,cex=1.2)

A comparison between the correlation matrixes can also help in order to know the influence of the outliers.

# Compare eigenvalues of both correlation matrices
par(mfrow=c(1,2))
corrplot(R)
corrplot(R_MCD)

In this case, we can see that there is a big difference in the correlations obtained with the data and through the MCD estimation, but it is not a reliable study as there are only 20 cars fueled by diesel and 13 predictors, so the number of observations is not large enough to draw conclusions.

A graphical analysis has been carried out in order to detect the outliers. In this graphs we are representing the Mahalanobis distances, and the outliers are graphed in green.

# Show the squared Mahalanobis distances with the final set of non_outliers
n = nrow(data_diesel)
p = ncol(data_diesel)

X_sq_Mah_MCD = MCD_est$mah
col_outliers_Mah_MCD = rep("black",n)
outliers_Mah_MCD = which(X_sq_Mah_MCD>qchisq(.99^(1/n),p))
outliers_Mah_MCD
## [1] 7
col_outliers_Mah_MCD[outliers_Mah_MCD] = "green"
par(mfrow=c(1,2))
plot(1:n,X_sq_Mah_MCD,pch=19,col=col_outliers_Mah_MCD,
     main="Squared Mahalanobis distances",xlab="Observation",ylab="Squared Mahalanobis distance")
abline(h=qchisq(.99^(1/n),p),lwd=3,col="blue")
plot(1:n,log(X_sq_Mah_MCD),pch=19,col=col_outliers_Mah_MCD,
     main="Log of squared Mahalanobis distances",
     xlab="Observation",ylab="Log of squared Mahalanobis distance")
abline(h=log(qchisq(.99^(1/n),p)),lwd=3,col="blue")

In this plot, we can see that there might be one outlier among the diesel cars.

4 Dimension reduction

4.1 Principle Component Analysis.

As we saw in the previous section, most of the variables were highly skewed, hence we are going to apply some transformations in order to make them more symmetrical as the Principle Components Analysis assumes normality in the data.

data_pca <- data_num

data_pca$wheel_base <- data_num$wheel_base
data_pca$length <- data_num$length
data_pca$width <- sqrt(data_num$width)
data_pca$height <- sqrt(data_num$height)
data_pca$curb_weight <- data_num$curb_weight
data_pca$engine_size <- log(data_num$engine_size)
data_pca$bore <- (data_num$bore)^2
data_pca$stroke <- (data_num$stroke)^2
data_pca$compressio_ratio <- log(data_num$compressio_ratio)
data_pca$horsepower <- log(data_num$horsepower)
data_pca$peak_rpm <- log(data_num$peak_rpm)
data_pca$highway_mpg <- (data_num$highway_mpg)
data_pca$price <- log(data_num$price)

We calculate the main characteristics of the transformed data.

(m = colMeans(data_pca, na.rm = T))
##       wheel_base           length            width           height 
##        98.756585       174.049268         8.117307         7.327839 
##      curb_weight      engine_size             bore           stroke 
##      2555.565854         4.800184        11.096593        10.624848 
## compressio_ratio       horsepower         peak_rpm      highway_mpg 
##         2.267389         4.577552         8.535553        30.751220 
##            price 
##         9.340520
(S_pca <- cov(data_pca))
##                    wheel_base       length        width       height
## wheel_base         36.2617824   64.9751887  0.628432373  0.590580132
## length             64.9751887  152.2086882  1.364495651  1.006668225
## width               0.6284324    1.3644957  0.017220635  0.006068115
## height              0.5905801    1.0066682  0.006068115  0.027786064
## curb_weight      2434.2967456 5638.3362004 59.281172679 25.340444097
## engine_size         1.0118760    2.5548097  0.028084803  0.005574286
## bore                5.5355847   13.7646896  0.134505851  0.069384137
## stroke              2.3826002    4.0467819  0.053054757  0.001364226
## compressio_ratio    0.3823485    0.4627282  0.005695758  0.012029935
## horsepower          0.8724088    2.5907102  0.029989803 -0.003967362
## peak_rpm           -0.2026168   -0.3439416 -0.002835638 -0.004950943
## highway_mpg       -22.5623242  -59.8680751 -0.613519821 -0.119513877
## price               1.9164296    4.7736147  0.049820214  0.015966360
##                   curb_weight   engine_size          bore        stroke
## wheel_base         2434.29675   1.011875998   5.535584702   2.382600223
## length             5638.33620   2.554809705  13.764689603   4.046781898
## width                59.28117   0.028084803   0.134505851   0.053054757
## height               25.34044   0.005574286   0.069384137   0.001364226
## curb_weight      271107.87432 128.809554691 620.468585256 210.266293895
## engine_size         128.80955   0.080069713   0.360089264   0.174499282
## bore                620.46859   0.360089264   3.487034348   0.073272214
## stroke              210.26629   0.174499282   0.073272214   4.200952379
## compressio_ratio     17.52156   0.001615863  -0.002199066   0.082792260
## horsepower          140.30672   0.078871207   0.369786629   0.051282575
## peak_rpm            -13.12866  -0.007481764  -0.055027454  -0.038049459
## highway_mpg       -2859.41736  -1.369197015  -6.766536511  -0.334482166
## price               226.46503   0.116871838   0.528439881   0.109801255
##                  compressio_ratio    horsepower      peak_rpm   highway_mpg
## wheel_base            0.382348516   0.872408843  -0.202616816 -2.256232e+01
## length                0.462728218   2.590710176  -0.343941560 -5.986808e+01
## width                 0.005695758   0.029989803  -0.002835638 -6.135198e-01
## height                0.012029935  -0.003967362  -0.004950943 -1.195139e-01
## curb_weight          17.521555536 140.306720084 -13.128664344 -2.859417e+03
## engine_size           0.001615863   0.078871207  -0.007481764 -1.369197e+00
## bore                 -0.002199066   0.369786629  -0.055027454 -6.766537e+00
## stroke                0.082792260   0.051282575  -0.038049459 -3.344822e-01
## compressio_ratio      0.079656005  -0.025304521  -0.011303940  5.789827e-01
## horsepower           -0.025304521   0.123294450   0.005519330 -2.016952e+00
## peak_rpm             -0.011303940   0.005519330   0.009203295 -3.338861e-02
## highway_mpg           0.578982748  -2.016951545  -0.033388613  4.742310e+01
## price                 0.009837654   0.137505926  -0.005619887 -2.649051e+00
##                          price
## wheel_base         1.916429581
## length             4.773614667
## width              0.049820214
## height             0.015966360
## curb_weight      226.465034900
## engine_size        0.116871838
## bore               0.528439881
## stroke             0.109801255
## compressio_ratio   0.009837654
## horsepower         0.137505926
## peak_rpm          -0.005619887
## highway_mpg       -2.649050854
## price              0.252050249
(R_pca <- cor(data_pca))
##                  wheel_base     length      width       height curb_weight
## wheel_base        1.0000000  0.8745875  0.7952605  0.588356757   0.7763863
## length            0.8745875  1.0000000  0.8428065  0.489500484   0.8777285
## width             0.7952605  0.8428065  1.0000000  0.277405927   0.8676032
## height            0.5883568  0.4895005  0.2774059  1.000000000   0.2919642
## curb_weight       0.7763863  0.8777285  0.8676032  0.291964226   1.0000000
## engine_size       0.5938388  0.7318207  0.7563323  0.118179340   0.8742646
## bore              0.4922784  0.5974734  0.5488940  0.222904360   0.6381468
## stroke            0.1930424  0.1600355  0.1972540  0.003992996   0.1970265
## compressio_ratio  0.2249705  0.1328914  0.1537863  0.255705750   0.1192319
## horsepower        0.4125948  0.5980360  0.6508445 -0.067782347   0.7674245
## peak_rpm         -0.3507351 -0.2905984 -0.2252447 -0.309601287  -0.2628317
## highway_mpg      -0.5440819 -0.7046616 -0.6789051 -0.104114173  -0.7974648
## price             0.6339058  0.7706977  0.7562014  0.190787030   0.8663363
##                  engine_size         bore       stroke compressio_ratio
## wheel_base        0.59383882  0.492278438  0.193042407      0.224970501
## length            0.73182072  0.597473418  0.160035451      0.132891438
## width             0.75633235  0.548894003  0.197254026      0.153786326
## height            0.11817934  0.222904360  0.003992996      0.255705750
## curb_weight       0.87426457  0.638146834  0.197026533      0.119231864
## engine_size       1.00000000  0.681471858  0.300874672      0.020233044
## bore              0.68147186  1.000000000  0.019144205     -0.004172545
## stroke            0.30087467  0.019144205  1.000000000      0.143122067
## compressio_ratio  0.02023304 -0.004172545  0.143122067      1.000000000
## horsepower        0.79380273  0.563963928  0.071256450     -0.255338964
## peak_rpm         -0.27561204 -0.307170596 -0.193509791     -0.417492715
## highway_mpg      -0.70264643 -0.526190922 -0.023697587      0.297893754
## price             0.82268234  0.563668685  0.106706251      0.069428650
##                   horsepower    peak_rpm highway_mpg       price
## wheel_base        0.41259477 -0.35073511 -0.54408192  0.63390580
## length            0.59803602 -0.29059843 -0.70466160  0.77069772
## width             0.65084454 -0.22524466 -0.67890508  0.75620135
## height           -0.06778235 -0.30960129 -0.10411417  0.19078703
## curb_weight       0.76742447 -0.26283174 -0.79746479  0.86633633
## engine_size       0.79380273 -0.27561204 -0.70264643  0.82268234
## bore              0.56396393 -0.30717060 -0.52619092  0.56366869
## stroke            0.07125645 -0.19350979 -0.02369759  0.10670625
## compressio_ratio -0.25533896 -0.41749271  0.29789375  0.06942865
## horsepower        1.00000000  0.16384867 -0.83412039  0.78002062
## peak_rpm          0.16384867  1.00000000 -0.05053959 -0.11668428
## highway_mpg      -0.83412039 -0.05053959  1.00000000 -0.76621697
## price             0.78002062 -0.11668428 -0.76621697  1.00000000

We can see that the scale of variances is very different, so we are calculating PCA based on the correlation matrix.

p1 <- prcomp(data_pca, scale = TRUE)
# Explained variability
fviz_eig(p1,addlabels = T,ylim=c(0,60))

We can see that with the first 3 components, we are able to explain 77.9% of the variance of the original data. Therefore, we are only representing graphically these three components.

The correlation between the original data and the first three components is calculated in order to explain the components and how they behave. Moreover, this correlations will be shown graphically with the help of biplots.

cor(data_pca, p1$x[,1:3])
##                          PC1           PC2         PC3
## wheel_base        0.82102846 -0.3545504773  0.18141494
## length            0.92073283 -0.1653848829  0.14575839
## width             0.89670250 -0.0669287444 -0.02436096
## height            0.34509406 -0.5959076626  0.51992260
## curb_weight       0.96848435  0.0003158146 -0.04013278
## engine_size       0.89590331  0.1053649652 -0.26254576
## bore              0.71474702  0.0145072936  0.06503974
## stroke            0.20198529 -0.2307933271 -0.80112039
## compressio_ratio  0.07163726 -0.7508055161 -0.16157392
## horsepower        0.78632884  0.5342303795 -0.07861467
## peak_rpm         -0.26932926  0.7036507788  0.16706329
## highway_mpg      -0.81851178 -0.4229832350 -0.10574682
## price             0.89061238  0.1432471789 -0.02187952
par(mfrow = c(1,3))
biplot(p1, main="PC1 vs PC2", cex=0.7)
biplot(p1, choices=c(1,3), main="PC1 vs PC3", cex=0.7) 
biplot(p1, choices=c(3,2), main="PC2 vs PC3", cex=0.7)

The first component has high positive correlations with the variables wheel_base, length, width, curb_weight, engine_size, bore, horsepower and price but it has a negative correlation with the highway_mpg. This means that the big, non environmentally friendly cars with a lot of power will be on the right side and the smaller cars with less consumption will be on the left.

The second component has a negative correlation with the height and compressio_ratio and a positive correlation with peak_rpm. This means that the cars which are tall and more efficient will be at the bottom of the graph while the shorter and less efficient cars will be at the top (PC1 vs PC2).

The third component has a negative correlation with stroke, which indicates the engine's displacement. The cars on the right side will have a longer engine displacement (PC3 vs PC2). Moreover, we can see that there is also positive correlation with the variable height, so the taller cars will be displayed on the left side of the biplot (PC3 vs PC2).

The representation of the first 3 components is shown below.

color_1 <- "deepskyblue2"
color_2 <- "darkorchid2"
colors_X <- c(color_1,color_2)[1*(data$fuel_type=="gas")+1]
pairs(p1$x[,1:3],col=colors_X,pch=19,main="The first three PCs")

We can see that the \(2^{nd}\) component is the most useful to identify both groups, being the cars fueled by diesel in blue and the cars fueled by gas in purple. Therefore, we can say that the diesel cars are usually shorter and more efficient and the gas cars are taller and less efficient. This conclusion verifies the assumptions made in the previous sections.

4.2 Independent component analysis

In this case, we are going to transform the numeric data linearly in order to obtain a new matrix, Z, with maximally independent and non-Gaussian variables.

The way this is achieved is by maximizing the negative entropy.

# Matrix dimension
n <- nrow(data_pca)
p <- ncol(data_pca)

# ICs after data transformation
X_trans_ica <- ica::icafast(data_pca,nc=p,alg="par")

# The ICA scores
Z <- X_trans_ica$S
colnames(Z) <- sprintf("IC-%d",seq(1,13))

# Re-scale to have S_z=I 
Z <- Z * sqrt((n-1)/n)

# Histograms of the ICA scores obtained
par(mar=c(1,1,1,1))
par(mfrow=c(4,4))
sapply(colnames(Z),function(cname){hist(as.data.frame(Z)[[cname]],
                                         main=cname,col=color_1,xlab="")})
##          IC-1                        IC-2                       
## breaks   numeric,11                  numeric,10                 
## counts   integer,10                  integer,9                  
## density  numeric,10                  numeric,9                  
## mids     numeric,10                  numeric,9                  
## xname    "as.data.frame(Z)[[cname]]" "as.data.frame(Z)[[cname]]"
## equidist TRUE                        TRUE                       
##          IC-3                        IC-4                       
## breaks   numeric,10                  numeric,9                  
## counts   integer,9                   integer,8                  
## density  numeric,9                   numeric,8                  
## mids     numeric,9                   numeric,8                  
## xname    "as.data.frame(Z)[[cname]]" "as.data.frame(Z)[[cname]]"
## equidist TRUE                        TRUE                       
##          IC-5                        IC-6                       
## breaks   numeric,8                   numeric,13                 
## counts   integer,7                   integer,12                 
## density  numeric,7                   numeric,12                 
## mids     numeric,7                   numeric,12                 
## xname    "as.data.frame(Z)[[cname]]" "as.data.frame(Z)[[cname]]"
## equidist TRUE                        TRUE                       
##          IC-7                        IC-8                       
## breaks   numeric,11                  numeric,10                 
## counts   integer,10                  integer,9                  
## density  numeric,10                  numeric,9                  
## mids     numeric,10                  numeric,9                  
## xname    "as.data.frame(Z)[[cname]]" "as.data.frame(Z)[[cname]]"
## equidist TRUE                        TRUE                       
##          IC-9                        IC-10                      
## breaks   numeric,10                  numeric,9                  
## counts   integer,9                   integer,8                  
## density  numeric,9                   numeric,8                  
## mids     numeric,9                   numeric,8                  
## xname    "as.data.frame(Z)[[cname]]" "as.data.frame(Z)[[cname]]"
## equidist TRUE                        TRUE                       
##          IC-11                       IC-12                      
## breaks   numeric,11                  numeric,9                  
## counts   integer,10                  integer,8                  
## density  numeric,10                  numeric,8                  
## mids     numeric,10                  numeric,8                  
## xname    "as.data.frame(Z)[[cname]]" "as.data.frame(Z)[[cname]]"
## equidist TRUE                        TRUE                       
##          IC-13                      
## breaks   numeric,11                 
## counts   integer,10                 
## density  numeric,10                 
## mids     numeric,10                 
## xname    "as.data.frame(Z)[[cname]]"
## equidist TRUE
# Compute the neg-entropy of the columns in Z and sort them in decreasing order of neg-entropy

neg_entropy <- function(z){1/12 * mean(z^3)^2 + 1/48 * mean(z^4)^2}
Z_neg_entropy <- apply(Z,2,neg_entropy)
ic_sort <- sort(Z_neg_entropy,decreasing=TRUE,index.return=TRUE)$ix
ic_sort
##  [1]  6 11 13  8  3  7  4  9  2 10 12  5  1
# Plot the sorted neg-entropy and define the ICs sorted by neg-entropy
par(mfrow=c(1,1))

plot(Z_neg_entropy[ic_sort],type="b",col=color_1,pch=19,
     ylab="Neg-entropy",main="Neg-entropies",lwd=3)

After plotting, the negative entropies we can see that there are 4 ICs larger than the other ones.

set.seed(10040485)
Z_ic_imp <- Z[,ic_sort]

# Plot the two ICs with largest neg-entropy
par(mfrow=c(1,1))
plot(Z_ic_imp[,1:2],pch=19,col=colors_X,
     xlab="First IC scores",ylab="Second IC scores",main="First two IC scores")

# Let's identify the points with large negative values of the first IC

which(Z_ic_imp[,1]<(-1))
##  [1]  11  12  13  14  15  17  19  41  66  67  98  99 131 132 155 163 205
# Let's identify the points with large positive values of the second IC

which(Z_ic_imp[,2]>(6))
## [1] 156

With the first two scores, its is pointed out that some cars fueled by gas are outliers.

In order to see what IC can show the differentiation between two groups, the scatterplot matrix with the highest and lowest neg-entropy will be plotted.

# Plot the ICs with the highest and lowest entropy
pairs(Z_ic_imp[,c(1:5,12:13)],col=colors_X,pch=19,
      main="ICs with the highest and lowest entropy")

plot(Z_ic_imp[,12:13],pch=19,col=colors_X,
     xlab="5-th IC scores",ylab="1-th IC scores",main="Last two IC scores")

In often cases, the differentiation between two groups can be seen in the pairs with the lowest negative entropy as it is where the minimum values of the kurtosis are attained. However, as it is shown in both graphs, we do not obtained a clear separation in this case.

plot(Z_ic_imp[,6:7],pch=19,col=colors_X,
      xlab="7-th IC scores",ylab="4-th IC scores")

Nevertheless, looking at the plot of the ICs we can see that the seventh IC had two different groups, so we decided to check them and we obtained that the formed groups are differentiated by the variable fuel_type.

The correlation between the data and the ICs is calculated:

# Correlation data vs ICs
corrplot(cor(data_pca,Z_ic_imp),is.corr=T)

In this case, the IC7 is highly positively correlated with the compressio-ratio and that is why this IC is able to differentiate between diesel and gas cars. On the other hand, the IC3 is negatively correlated with highway_mpg.

Finally, we check the correlation between the PCA components and the ICs.

# Correlation PCA vs ICs
corrplot(cor(prcomp(data_pca)$x,Z),is.corr=T)

As it can be seen, the correlations are not very high for the most part, which is the most common situation. However, there are some that could be highlighted, for example the negative correlation between the IC4 and PC13 and the positive correlation between IC7 and PC3.

4.3 Factor analysis

r <- 3
data_pca<-scale(data_pca)
p<-ncol(data_pca)
# Initial estimation of M and Sigma_nu
M_0 <- p1$rotation[,1:r] %*% diag(p1$sdev[1:r])
M_0
##                         [,1]          [,2]        [,3]
## wheel_base        0.82102846 -0.3545504773  0.18141494
## length            0.92073283 -0.1653848829  0.14575839
## width             0.89670250 -0.0669287444 -0.02436096
## height            0.34509406 -0.5959076626  0.51992260
## curb_weight       0.96848435  0.0003158146 -0.04013278
## engine_size       0.89590331  0.1053649652 -0.26254576
## bore              0.71474702  0.0145072936  0.06503974
## stroke            0.20198529 -0.2307933271 -0.80112039
## compressio_ratio  0.07163726 -0.7508055161 -0.16157392
## horsepower        0.78632884  0.5342303795 -0.07861467
## peak_rpm         -0.26932926  0.7036507788  0.16706329
## highway_mpg      -0.81851178 -0.4229832350 -0.10574682
## price             0.89061238  0.1432471789 -0.02187952
S_y <- cov(data_pca)
Sigma_nu_0 <- diag(diag(S_y - M_0 %*% t(M_0)))
Sigma_nu_0
##            [,1]      [,2]      [,3]      [,4]       [,5]      [,6]      [,7]
##  [1,] 0.1672948 0.0000000 0.0000000 0.0000000 0.00000000 0.0000000 0.0000000
##  [2,] 0.0000000 0.1036534 0.0000000 0.0000000 0.00000000 0.0000000 0.0000000
##  [3,] 0.0000000 0.0000000 0.1908517 0.0000000 0.00000000 0.0000000 0.0000000
##  [4,] 0.0000000 0.0000000 0.0000000 0.2554846 0.00000000 0.0000000 0.0000000
##  [5,] 0.0000000 0.0000000 0.0000000 0.0000000 0.06042732 0.0000000 0.0000000
##  [6,] 0.0000000 0.0000000 0.0000000 0.0000000 0.00000000 0.1173252 0.0000000
##  [7,] 0.0000000 0.0000000 0.0000000 0.0000000 0.00000000 0.0000000 0.4846961
##  [8,] 0.0000000 0.0000000 0.0000000 0.0000000 0.00000000 0.0000000 0.0000000
##  [9,] 0.0000000 0.0000000 0.0000000 0.0000000 0.00000000 0.0000000 0.0000000
## [10,] 0.0000000 0.0000000 0.0000000 0.0000000 0.00000000 0.0000000 0.0000000
## [11,] 0.0000000 0.0000000 0.0000000 0.0000000 0.00000000 0.0000000 0.0000000
## [12,] 0.0000000 0.0000000 0.0000000 0.0000000 0.00000000 0.0000000 0.0000000
## [13,] 0.0000000 0.0000000 0.0000000 0.0000000 0.00000000 0.0000000 0.0000000
##            [,8]     [,9]     [,10]     [,11]     [,12]     [,13]
##  [1,] 0.0000000 0.000000 0.0000000 0.0000000 0.0000000 0.0000000
##  [2,] 0.0000000 0.000000 0.0000000 0.0000000 0.0000000 0.0000000
##  [3,] 0.0000000 0.000000 0.0000000 0.0000000 0.0000000 0.0000000
##  [4,] 0.0000000 0.000000 0.0000000 0.0000000 0.0000000 0.0000000
##  [5,] 0.0000000 0.000000 0.0000000 0.0000000 0.0000000 0.0000000
##  [6,] 0.0000000 0.000000 0.0000000 0.0000000 0.0000000 0.0000000
##  [7,] 0.0000000 0.000000 0.0000000 0.0000000 0.0000000 0.0000000
##  [8,] 0.2641425 0.000000 0.0000000 0.0000000 0.0000000 0.0000000
##  [9,] 0.0000000 0.405053 0.0000000 0.0000000 0.0000000 0.0000000
## [10,] 0.0000000 0.000000 0.0901046 0.0000000 0.0000000 0.0000000
## [11,] 0.0000000 0.000000 0.0000000 0.4044272 0.0000000 0.0000000
## [12,] 0.0000000 0.000000 0.0000000 0.0000000 0.1399413 0.0000000
## [13,] 0.0000000 0.000000 0.0000000 0.0000000 0.0000000 0.1858111
# Estimation of M without varimax rotation
MM <- S_y - Sigma_nu_0
MM_eig <- eigen(MM)
MM_values <- MM_eig$values
MM_vectors <- MM_eig$vectors
M_1 <- MM_eig$vectors[,1:r] %*% diag(MM_eig$values[1:r])^(1/2)
M_1
##              [,1]         [,2]        [,3]
##  [1,] -0.80986466 -0.369649467  0.16327782
##  [2,] -0.91728681 -0.186421820  0.14566568
##  [3,] -0.88244859 -0.076442188 -0.02047855
##  [4,] -0.33453838 -0.583821992  0.42740464
##  [5,] -0.97155623 -0.007703445 -0.04059021
##  [6,] -0.89102389  0.106396574 -0.26576796
##  [7,] -0.67211144  0.009007158  0.03476317
##  [8,] -0.19643762 -0.212739965 -0.70063443
##  [9,] -0.06649663 -0.644866889 -0.14944683
## [10,] -0.78695521  0.542982522 -0.05653694
## [11,]  0.25383158  0.607354461  0.15938112
## [12,]  0.81294693 -0.409931274 -0.12999536
## [13,] -0.87767925  0.134498062 -0.01258802
# Final estimation of M and Sigma_nu after varimax rotation for interpretability
M <- varimax(M_1)
M <- loadings(M)[1:p,1:r]
M
##              [,1]        [,2]        [,3]
##  [1,] -0.70611902 -0.56387369  0.05122042
##  [2,] -0.85279224 -0.40924838  0.05145709
##  [3,] -0.83839551 -0.27032346 -0.09484986
##  [4,] -0.20429325 -0.70235815  0.31685338
##  [5,] -0.94030964 -0.22132060 -0.11163962
##  [6,] -0.88023246 -0.05911946 -0.31236535
##  [7,] -0.65627156 -0.14885392 -0.01317384
##  [8,] -0.11398373 -0.14281393 -0.73576255
##  [9,]  0.09251637 -0.61280540 -0.24190849
## [10,] -0.88965830  0.35273277 -0.03758526
## [11,]  0.09786681  0.61764037  0.26010520
## [12,]  0.89082893 -0.19051935 -0.12629866
## [13,] -0.88358029 -0.06749423 -0.05744083
Sigma_nu <- diag(diag(S_y - M %*% t(M)))
Sigma_nu
##            [,1]      [,2]      [,3]      [,4]       [,5]      [,6]      [,7]
##  [1,] 0.1808189 0.0000000 0.0000000 0.0000000 0.00000000 0.0000000 0.0000000
##  [2,] 0.0000000 0.1026133 0.0000000 0.0000000 0.00000000 0.0000000 0.0000000
##  [3,] 0.0000000 0.0000000 0.2150217 0.0000000 0.00000000 0.0000000 0.0000000
##  [4,] 0.0000000 0.0000000 0.0000000 0.3645612 0.00000000 0.0000000 0.0000000
##  [5,] 0.0000000 0.0000000 0.0000000 0.0000000 0.05437158 0.0000000 0.0000000
##  [6,] 0.0000000 0.0000000 0.0000000 0.0000000 0.00000000 0.1241236 0.0000000
##  [7,] 0.0000000 0.0000000 0.0000000 0.0000000 0.00000000 0.0000000 0.5469766
##  [8,] 0.0000000 0.0000000 0.0000000 0.0000000 0.00000000 0.0000000 0.0000000
##  [9,] 0.0000000 0.0000000 0.0000000 0.0000000 0.00000000 0.0000000 0.0000000
## [10,] 0.0000000 0.0000000 0.0000000 0.0000000 0.00000000 0.0000000 0.0000000
## [11,] 0.0000000 0.0000000 0.0000000 0.0000000 0.00000000 0.0000000 0.0000000
## [12,] 0.0000000 0.0000000 0.0000000 0.0000000 0.00000000 0.0000000 0.0000000
## [13,] 0.0000000 0.0000000 0.0000000 0.0000000 0.00000000 0.0000000 0.0000000
##            [,8]      [,9]      [,10]     [,11]     [,12]    [,13]
##  [1,] 0.0000000 0.0000000 0.00000000 0.0000000 0.0000000 0.000000
##  [2,] 0.0000000 0.0000000 0.00000000 0.0000000 0.0000000 0.000000
##  [3,] 0.0000000 0.0000000 0.00000000 0.0000000 0.0000000 0.000000
##  [4,] 0.0000000 0.0000000 0.00000000 0.0000000 0.0000000 0.000000
##  [5,] 0.0000000 0.0000000 0.00000000 0.0000000 0.0000000 0.000000
##  [6,] 0.0000000 0.0000000 0.00000000 0.0000000 0.0000000 0.000000
##  [7,] 0.0000000 0.0000000 0.00000000 0.0000000 0.0000000 0.000000
##  [8,] 0.4252654 0.0000000 0.00000000 0.0000000 0.0000000 0.000000
##  [9,] 0.0000000 0.5573905 0.00000000 0.0000000 0.0000000 0.000000
## [10,] 0.0000000 0.0000000 0.08267505 0.0000000 0.0000000 0.000000
## [11,] 0.0000000 0.0000000 0.00000000 0.5412877 0.0000000 0.000000
## [12,] 0.0000000 0.0000000 0.00000000 0.0000000 0.1541749 0.000000
## [13,] 0.0000000 0.0000000 0.00000000 0.0000000 0.0000000 0.211431
# Understanding the relationship between observable variables and factors
# First factor 
plot(1:p,M[,1],pch=19,col=color_1,xlab="",ylab="Loadings",main="Loadings for the first factor")
abline(h=0)
text(1:p,M[,1],labels=colnames(data_pca),pos=1,col=color_2,cex=0.75)

# Second factor 
## 4 is height and 9 is compressio_ratio
plot(1:p,M[,2],pch=19,col=color_1,xlab="",ylab="Loadings",main="Loadings for the second factor")
abline(h=0)
text(1:p,M[,2],labels=colnames(data_pca),pos=1,col=color_2,cex=0.75)

# Third factor  
# 8 is stroke
plot(1:p,M[,3],pch=19,col=color_1,xlab="",ylab="Loadings",main="Loadings for the third factor")
abline(h=0)
text(1:p,M[,3],labels=colnames(data_pca),pos=1,col=color_2,cex=0.75)

# Communalities
comM <- diag(M %*% t(M))
comM
##  [1] 0.8191811 0.8973867 0.7849783 0.6354388 0.9456284 0.8758764 0.4530234
##  [8] 0.5747346 0.4426095 0.9173250 0.4587123 0.8458251 0.7885690
names(comM) <- colnames(data_pca)
plot(1:p,sort(comM,decreasing=TRUE),pch=20,col=color_1,xlim=c(0,15),
     xlab="Variables",ylab="Communalities",main="Communalities")
text(1:p,sort(comM,decreasing=TRUE),labels=names(sort(comM,decreasing=TRUE)),
     pos=4,col=color_2,cex=0.75)

# Estimate the factor scores
Fact <- data_pca %*% solve(Sigma_nu) %*% M %*% solve(t(M) %*% solve(Sigma_nu) %*% M)
colnames(Fact) <- c("Factor 1","Factor 2","Factor 3")

# See that the factors are uncorrelated
pairs(Fact,pch=19,col=color_1)

corrplot(cor(Fact),order="hclust")

# Estimate the residuals

Nu <- data_pca - Fact %*% t(M)
corrplot(cor(Nu),order="hclust")

Watching the plots that represent the relationship between variables and factors, we obtain that the first factor symbolizes the smallness of the cars as the cars with higher values are the least large, least powerful and with less consumption in highgway (most miles per gallon). The second factor is an index of power and poor efficiency (compression_ratio, variable number 9, means efficiency). The third factor symbolizes tall cars with a short engine displacement.

If we analyze the interpretations of the biplots in the PCA and the interpretations of the factor analysis we find out that the relationship between the principal components and the variables is the same as the relationship of variables and factor but in the case of the first elements they present the opposite signs: if the PC1 is related with big and powerful cars, the first factor is an index of small and less powerful cars.

Then, in the communalities plot we can see that the variables that are better explained by the factors are curb_weight, length, horsepower, engine_size, highway_mpg and wheel_base and clearly the worst explained one is compressio_ratio. Afterwards, we compute the estimations of the factors' scores and with the plots we check that there is practically no correlation between them. Finally, with the representation of the correlation matrix of the residuals we find out that there exists some correlations that the factor model is not able to explain.

5 Unsupervised classification

5.1 Partitional clustering

pairs(p1$x[,1:2],pch=19,main="The first two PCs")

These last two graphs are tricky. When the first component is represented in the x-axis (plot on the bottom left) it does not seem that there is an obvious differentiation between two groups. On the other hand, when the second component is on the x-axis it may look like there is an obvious distinctness between the group on the left with few components and the group on the right with a larger number of instances. Hereafter, it is explained how the clustering was carried out.

5.1.1 K-means

First of all, it has to be pointed out that for this methods it is necessary to standarize the variables.

When we use the 'Within-cluster sum of squares' as a criteria to determine the optimal number of clusters, the result obtained is not clear at all as there is no evident 'elbow' in the plot.

color_1 <- "deepskyblue2"
color_2 <- "darkorchid4"
color_3 <- "seagreen2"
color_4 <- "orange2"
data_nume<-scale(data_num)
fviz_nbclust(data_nume,kmeans,method="wss",k.max=10)

Therefore, we proceed to determine the optimal value of 'k' with the 'Average silhouette'. We obtained that the best option was using k=2 as it presented the highest average as it can be seen in the plot below.

fviz_nbclust(data_nume,kmeans,method="silhouette",k.max=10)

Then we carry out the k-means approach and we represent the observations with the color of their respective group. We can observe that there is a cluster on the right with fewer instances than the one located on the left. Through the silhouette plot we can denote that all of the observations (except one) belong to the cluster whose mean is closer to that observation (because all the values are positive). That is a good indicative as the negative value means that the observation should be part of the other group because it is closer to the other mean. We also obtained that the average silhouette width for this method is 0.33.

kmeans_X <- kmeans(data_nume,centers=2,iter.max=1000,nstart=100)
colors_kmeans_X <- c(color_1,color_2)[kmeans_X$cluster]
par(mfrow=c(1,2))
plot(p1$x[,1:2],pch=19,col=colors_kmeans_X,xlab="First PC",ylab="Second PC")
sil_kmeans_X <- silhouette(kmeans_X$cluster,dist(data_nume,"euclidean"))
plot(sil_kmeans_X,col=color_1)

5.1.2 K-medoids

Another possibility for clustering is the k-medoids clustering which instead of means, uses medoids (center observation of the cluster) and instead of the the Euclidean distance it uses the Manhattan (or Gower) distance. There are different algorithms in the k-medoid method:

5.1.2.1 PAM (I)

PAM (Partition Around Medoids) is the standard k-medoids algorithm. The resultant clustering gives a more balanced partition as it can be seen in the first plot. But even if the average silhouette (0.35) is larger than the previous one, this option is worse as there are more negative values and therefore more wrongly assigned observations.

pam_X <- pam(data_nume,k=2,metric="manhattan",stand=FALSE)
colors_pam_X <- c(color_1,color_2)[pam_X$cluster]
par(mfrow=c(1,2))
plot(p1$x[,1:2],pch=19,col=colors_pam_X,xlab="First PC",ylab="Second PC")
sil_pam_X <- silhouette(pam_X$cluster,dist(data_nume,method="manhattan"))
plot(sil_pam_X,col=color_1)

5.1.2.2 PAM (II)

Another option is to use the Gower distance with the PAM algorithm. We have obtained that adding the non-numerical variables as binary variables (through one-hot encoding) results on tacking on a lot of noise and therefore the result is less likeable. That is the reason why we use just the numeric data.

#Just numeric data
X_Gower <- daisy(data_nume,metric="gower")
summary(X_Gower)
## 20910 dissimilarities, summarized :
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1290  0.1849  0.1983  0.2507  0.6625 
## Metric :  mixed ;  Types = I, I, I, I, I, I, I, I, I, I, I, I, I 
## Number of objects : 205
X_Gower_mat <- as.matrix(X_Gower)
X_K <- matrix(NA,nrow=1,ncol=19)
for (i in 1:19){
  pam_X_Gower_mat <- pam(X_Gower_mat,k=i+1,diss=TRUE)
  X_K[i] <- pam_X_Gower_mat$silinfo$avg.width
}
plot(2:20,X_K,pch=19,col="deepskyblue2",xlab="Number of clusters",ylab="Average silhouette")

pam_X_Gower_mat <- pam(X_Gower_mat,k=2,diss=TRUE)
colors_pam_2 <- c(color_1,color_2)[pam_X_Gower_mat$cluster]
par(mfrow=c(1,2))
plot(p1$x[,1:2],pch=19,col=colors_pam_2,xlab="First PC",ylab="Second PC")
sil_pam_X_Gower_mat <- silhouette(pam_X_Gower_mat$cluster,X_Gower_mat)
plot(sil_pam_X_Gower_mat,col=color_1)

summary(sil_pam_X_Gower_mat)
## Silhouette of 205 units in 2 clusters from silhouette.default(x = pam_X_Gower_mat$cluster, dist = X_Gower_mat) :
##  Cluster sizes and average silhouette widths:
##       112        93 
## 0.4635295 0.2064305 
## Individual silhouette widths:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.2262  0.2323  0.3676  0.3469  0.5131  0.6225

We get a similar result as the one obtained with the Manhattan distance: there are some negative values in the silhouette graph and therefore we still prefer the k-means method.

5.1.2.2.1 Adding factors
data$symboling<-as.factor(data$symboling)
data_cat<-data[,c(1:8,14,15,17)]
data_cat <- one_hot(as.data.table(data_cat))
data_e<-cbind(data_nume,data_cat)
X_Gower <- daisy(data_e,metric="gower")
summary(X_Gower)
## 20910 dissimilarities, summarized :
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000144 0.128190 0.169110 0.169360 0.209430 0.351850 
## Metric :  mixed ;  Types = I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I 
## Number of objects : 205
X_Gower_mat <- as.matrix(X_Gower)
X_K <- matrix(NA,nrow=1,ncol=19)
for (i in 1:19){
  pam_X_Gower_mat <- pam(X_Gower_mat,k=i+1,diss=TRUE)
  X_K[i] <- pam_X_Gower_mat$silinfo$avg.width
}
plot(2:20,X_K,pch=19,col="deepskyblue2",xlab="Number of clusters",ylab="Average silhouette")

which.max(X_K)+1
## [1] 20

5.1.2.3 CLARA

The aim of CLARA (Clustering for Large Applications) is computational efficiency by running PAM to random sub-samples, not to improve PAM results. In the Silhouette plot we found that the average silhoutte is slightly larger than the previous ones but there are still some negative values, that is why we still prefering the k-means solution.

clara_X <- clara(data_nume,k=2,metric="manhattan",stand=FALSE)
colors_clara_X <- c(color_1,color_2)[clara_X$cluster]
par(mfrow=c(1,2))
plot(p1$x[,1:2],pch=19,col=colors_clara_X,xlab="First PC",ylab="Second PC")
sil_clara_X <- silhouette(clara_X$cluster,dist(data_nume,method="manhattan"))
plot(sil_clara_X,col=color_1)

5.2 Hierarchical

These are methods that do not require to fix K as they explore cluster solutions for every possible value of k. As it is possible to use mixed data (using the Gower distance) we have tried different options to find out if in our case it was better to use only numeric data or the mixed dataset.

5.2.1 Aglomerative

In these cases, the algorithms start from k = n and then merges clusters based on the distances between them. There are different methods to compute the distance between a new cluster and the other clusters, these are called 'linkage methods' and are the following ones:

man_dist_X <- daisy(data_nume,metric="manhattan",stand=F)
gower_dist <- daisy(data_e,metric="gower",stand=F)
## Warning in daisy(data_e, metric = "gower", stand = F): binary variable(s) 14,
## 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
## 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54,
## 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74,
## 75, 76, 77, 78, 79 treated as interval scaled

5.2.1.1 Single linkage

5.2.1.1.1 Manhattan distance (numeric data)
single_X <- hclust(man_dist_X,method="single")
plot(single_X,main="Single linkage",cex=0.8)
rect.hclust(single_X,k=2,border=color_1)

cl_single_X <- cutree(single_X,2)
colors_single_X <- c(color_1,color_2)[cl_single_X]
par(mfrow=c(1,2))
plot(p1$x[,1:2],pch=19,col=colors_single_X, xlab="First PC",ylab="Second PC")
sil_single_X <- silhouette(cl_single_X,man_dist_X)
plot(sil_single_X,col=color_1)

5.2.1.1.2 Gower distance (mixed data)
single_X <- hclust(gower_dist,method="single")
plot(single_X,main="Single linkage",cex=0.8)
rect.hclust(single_X,k=2,border=color_1)

cl_single_X <- cutree(single_X,2)
colors_single_X <- c(color_1,color_2)[cl_single_X]
par(mfrow=c(1,2))
plot(p1$x[,1:2],pch=19,col=colors_single_X, xlab="First PC",ylab="Second PC")
sil_single_X <- silhouette(cl_single_X,man_dist_X)
plot(sil_single_X,col=color_1)

Normally this linkage method tends to create large clusters and this is what we have obtained with both approaches. The results are quite awful as we can see in both the representation of the dendogram does not look right and the plot of the Silhouette shows that there are a lot of observations wrongly assinged (negative values). These were not a good option.

5.2.1.2 Complete

5.2.1.2.1 Manhattan distance (numeric data)
complete_X <- hclust(man_dist_X,method="complete")
plot(complete_X,main="Complete linkage",cex=0.8)
rect.hclust(complete_X,k=2,border=color_1)

cl_complete_X <- cutree(complete_X,2)
colors_complete_X <- c(color_1,color_2)[cl_complete_X]
par(mfrow=c(1,2))
plot(p1$x[,1:2],pch=19,col=colors_complete_X, xlab="First PC",ylab="Second PC")
sil_complete_X <- silhouette(cl_complete_X,man_dist_X)
plot(sil_complete_X,col=color_1)

In this case the dendogram looks better. As it was expected the groups are a little bit more balanced (still imbalanced) and we obtained the Silhouette still presents more negative values than the k-means method.

5.2.1.2.2 Gower distance (mixed data)
complete_X <- hclust(gower_dist,method="complete")
plot(complete_X,main="Complete linkage",cex=0.8)
rect.hclust(complete_X,k=2,border=color_1)

cl_complete_X <- cutree(complete_X,2)
colors_complete_X <- c(color_1,color_2)[cl_complete_X]
par(mfrow=c(1,2))
plot(p1$x[,1:2],pch=19,col=colors_complete_X, xlab="First PC",ylab="Second PC")
sil_complete_X <- silhouette(cl_complete_X,man_dist_X)
plot(sil_complete_X,col=color_1)

Using the Gower distance the groups are also imbalanced as it can be seen in the dendogram and in the Silhouette even with a smaller average Silhouette, the results are similar to the previous approach. The conclusion is the same, k-means is still a better option.

5.2.1.3 Average

This method was expected to give an intermediate solution between the single and the complete linkage methods, and this is what we have seen in the plots.

5.2.1.3.1 Manhattan distance (numeric data)
average_X <- hclust(man_dist_X,method="average")
plot(average_X,main="Average linkage",cex=0.8)
rect.hclust(average_X,k=2,border=color_1)

cl_average_X <- cutree(average_X,2)
colors_average_X <- c(color_1,color_2)[cl_average_X]
par(mfrow=c(1,2))
plot(p1$x[,1:2],pch=19,col=colors_average_X, xlab="First PC",ylab="Second PC")
sil_average_X <- silhouette(cl_average_X,man_dist_X)
plot(sil_average_X,col=color_1)

In this approach we have a more imbalanced result and eventhough the average silhouette is the largest, there are more negative values, what makes it a worse method.

5.2.1.3.2 Gower distance (mixed data)
average_X <- hclust(gower_dist,method="average")
plot(average_X,main="Average linkage",cex=0.8)
rect.hclust(average_X,k=2,border=color_1)

cl_average_X <- cutree(average_X,2)
colors_average_X <- c(color_1,color_2)[cl_average_X]
par(mfrow=c(1,2))
plot(p1$x[,1:2],pch=19,col=colors_average_X, xlab="First PC",ylab="Second PC")
sil_average_X <- silhouette(cl_average_X,man_dist_X)
plot(sil_average_X,col=color_1)

With the Gower distance and the mixed dataset the clusters are more imbalanced and the result is even worse as there are more wrongly assigned observations.

5.2.1.4 Ward

5.2.1.4.1 Manhattan distance (numeric data)
ward_X <- hclust(man_dist_X,method="ward")
plot(ward_X,main="Ward linkage",cex=0.8)
rect.hclust(ward_X,k=2,border=color_1)

cl_ward_X <- cutree(ward_X,2)
colors_ward_X <- c(color_1,color_2)[cl_ward_X]
par(mfrow=c(1,2))
plot(p1$x[,1:2],pch=19,col=colors_ward_X,xlab="First PC",ylab="Second PC")
sil_ward_X <- silhouette(cl_ward_X,man_dist_X)
plot(sil_ward_X,col=color_1)

5.2.1.4.2 Gower distance (mixed data)
ward_X <- hclust(gower_dist,method="ward")
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
plot(ward_X,main="Ward linkage",cex=0.8)
rect.hclust(ward_X,k=2,border=color_1)

cl_ward_X <- cutree(ward_X,2)
colors_ward_X <- c(color_1,color_2)[cl_ward_X]
par(mfrow=c(1,2))
plot(p1$x[,1:2],pch=19,col=colors_ward_X,xlab="First PC",ylab="Second PC")
sil_ward_X <- silhouette(cl_ward_X,man_dist_X)
plot(sil_ward_X,col=color_1)

As it was expected, the ward linkage method provides similar results to the k-means method. More balanced groups than the other linkage methods are found but the Silhouette plots show that there are more observations in clusters that they do not belong to than with the k-means method. Therefore, we still prefer the k-means approach.

An interesting point found out in this section is that in general with the Gower distance, lower average silhouettes are obtained (that might be bacuse the group means are closer with this distance) and more observations are wrongly assigned. Generally Manhattan distance finds better solutions.

5.2.2 Divisive (DIANA)

In the Divisive Analysis Clustering (DIANA) all the observations belong to a single cluster, and at each step the cluster with the largest diameter is divided into two clusters and so on until there is a cluster for each observation.

diana_X <- diana(data_nume,metric="manhattan")
par(mfrow=c(1,2))
plot(diana_X,main="DIANA")
# The heights here are the diameters of the clusters before splitting
rect.hclust(diana_X,k=2,border=color_1)

cl_diana_X <- cutree(diana_X,2)
colors_diana_X <- c(color_1,color_2)[cl_diana_X]
par(mfrow=c(1,2))
plot(p1$x[,1:2],pch=19,col=colors_diana_X,xlab="First PC",ylab="Second PC")
sil_diana_X <- silhouette(cl_diana_X,man_dist_X)
plot(sil_diana_X,col=color_1)

This approach presents the best results until this moment. With the dendogram and the PC's plot we can see that the groups are balanced (as in k-means) and in the Silhouette plot no negative value is present, therefore this solution is the best one yet.

5.3 Model based (M-CLUST)

This approach does not use distances but probability to obtain results. Asuming that the observations are generated by different distributions with certain probabilities, these observations are assigned to different clusters according to the Bayes Theorem. M-clust is the most widely used model-based clustering method and it works like this:

BIC_X <- mclustBIC(p1$x[,1:2],G=1:5)
plot(BIC_X)

Mclust_X <- Mclust(p1$x[,1:2],x=BIC_X)
summary(Mclust_X)
## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust VVI (diagonal, varying volume and shape) model with 3 components: 
## 
##  log-likelihood   n df       BIC       ICL
##       -803.0912 205 14 -1680.704 -1712.688
## 
## Clustering table:
##   1   2   3 
##  45 142  18
# Plot the groups
par(mfrow=c(1,2))
plot(Mclust_X,what="classification")
# Plot the estimated densities
plot(Mclust_X,what="density")

# Plot the estimated probabilities of the observations for each cluster
colors_Mclust_X <- c(color_1,color_2,color_3)[Mclust_X$classification]
par(mfrow=c(2,2))
plot(1:nrow(data_num),Mclust_X$z[,1],pch=19,col=colors_Mclust_X,main="Cluster 1",
     xlab="Cancer cell",ylab="Probability of cluster 1")
plot(1:nrow(data_num),Mclust_X$z[,2],pch=19,col=colors_Mclust_X,main="Cluster 2",
     xlab="Cancer cell",ylab="Probability of cluster 2")
plot(1:nrow(data_num),Mclust_X$z[,3],pch=19,col=colors_Mclust_X,main="Cluster 3",
     xlab="Cancer cell",ylab="Probability of cluster 3")

# Plot the points with uncertainty
#par(mfrow=c(1,1))
plot(Mclust_X,what="uncertainty")

In the first plot we can see that the model with the best results is the model with 3 clusters in which the covariance matrices are diagonal, have different volume and shape (VVI). With the two following graphs we can observe the distribution of the clusters and their diagonal densities. In the last plots the probabilities of belonging to each cluster are represented and also the observations that provoke more indecision about its assingment are shown.

sil_mclust_X <- silhouette(Mclust_X$classification,dist(data_nume,method="manhattan"))
plot(sil_mclust_X,col=color_1)

Finally, we computed the Silhouette plot of this method to find out that this approach is not as good as the previous methods because there are more and larger negative values. This means that there are more points that are wrongly assigned and with a larger error magnitude.

5.4 Density

Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is an algorithm that uses distances (normally Euclidean) between observations to cluster. Two parameters have to be set, \(\epsilon\) which is a measure of closeness and \(MP\) which is the minimum number of points to form a cluster. In this case, the fixed values are \(1.05\) for \(\epsilon\) and \(10\) for \(MP\) because they present the best results.

minPts <- 10
par(mfrow=c(1,2))
kNNdistplot(p1$x[,1:2],k=minPts-1)
abline(h=1.05,lty=2,lwd=2,col=color_1)
dbscan_Z <- dbscan(p1$x[,1:2],eps=1.05,minPts=minPts)
dbscan_Z
## DBSCAN clustering for 205 objects.
## Parameters: eps = 1.05, minPts = 10
## Using euclidean distances and borderpoints = TRUE
## The clustering contains 1 cluster(s) and 31 noise points.
## 
##   0   1 
##  31 174 
## 
## Available fields: cluster, eps, minPts, dist, borderPoints
colors_dbscan_Z <- c(color_1,color_2,color_3,color_4)[dbscan_Z$cluster+1]
plot(p1$x[,1:2],pch=19,col=colors_dbscan_Z,xlab="First PC",ylab="Second PC")

In the first plot we can see why \(\epsilon\) is \(1.05\) as the first elbow of the graph is in that point. In the second plot it is represented that this method with these parameters finds that the best option is to form a big central cluster and that the rest of observations (colored in blue) are noise.

sil_dbscan_X <- silhouette(dbscan_Z$cluster,dist(data_nume,method="manhattan"))
plot(sil_dbscan_X,col=color_1)

To measure the goodness of this method, we computed the Silhouette plot and the result is worse than the results from the DIANA algorithm. The average Silhouette is similar but there are several points that do not belong to the cluster that they were assigned, therefore we do not prefer the latter method.

5.5 Conclusion of the unsupervised analysis

As it was mentioned before, the best results were obtained with the DIANA method as no negative value was present in the Silhouette plot and therefore no wrong assingnation took place.

6 Supervised classification

The surpervised classification will be carried out with the fuel_type variable, which is heavily imbalanced so the results may not be reliable.

Moreover, this part of the project will be carried out only the numerical values will be taken into account because the categorical values of the 20 cars fueled by diesel are almost constant which can cause problems with the classifications based on the Bayes Theorem. In addition, for the KNN classification, the Euclidean distance will be used, and it does not work well with categorical values.

options(digits=4)

color_3 <- "seagreen2"
color_4 <- "indianred2"
        
# Define the data matrix and the response vector

X <- data_num[,!names(data_num) %in% c("Class")]
Y <- data$fuel_type

c_gas <- sum(Y=="gas")
c_diesel <- sum(Y=="diesel")
c(c_gas,c_diesel)
## [1] 185  20
n = nrow(data_num)
pr_gas <- c_gas/n
pr_diesel <- c_diesel/n
c(pr_gas,pr_diesel)
## [1] 0.90244 0.09756

6.1 Visual analysis

An initial graphical analysis is going to be done in order to identify how difficult it will be to solve the problem.

6.1.1 Parallel Coordinates Plot

colors_Y <- c(color_1,color_2)[1*(Y=="diesel")+1]
parcoord(X,col=colors_Y,var.label=F,main="PCP")

In this graph we can see that many of the variables do not allow to distinguish well the groups. However, there is one that clearly differenciates the data, which is compressio_ratio as we saw in the previous sections

X_skewness <- apply(X,2,skewness)
sort(X_skewness,decreasing=TRUE)
## compressio_ratio      engine_size            price       horsepower 
##          2.59172          1.93337          1.81838          1.36530 
##       wheel_base            width      curb_weight      highway_mpg 
##          1.04251          0.89738          0.67640          0.53604 
##           length           height         peak_rpm             bore 
##          0.15481          0.06266          0.04847         -0.02429 
##           stroke 
##         -0.63745
plot(1:ncol(X),apply(X,2,skewness),type="h",pch=19,col=color_1,xlab="Variables",
     ylab="Skewness coefficients",main="Skewness coefficients")

We can see that the largest skewness coefficient is 2.59 and the minimum is -0.63, so not all the variables are very skewed.

The variables which are skewed are: - compressio_ratio
- engine_size
- price
- horsepower

6.1.2 PCA

X_pcs <- prcomp(X,scale=TRUE)
summary(X_pcs)
## Importance of components:
##                          PC1   PC2    PC3    PC4    PC5    PC6    PC7    PC8
## Standard deviation     2.589 1.460 1.0661 0.9126 0.8354 0.6462 0.5978 0.4721
## Proportion of Variance 0.516 0.164 0.0874 0.0641 0.0537 0.0321 0.0275 0.0171
## Cumulative Proportion  0.516 0.680 0.7670 0.8311 0.8848 0.9169 0.9444 0.9615
##                           PC9    PC10    PC11    PC12    PC13
## Standard deviation     0.4210 0.34191 0.30269 0.25779 0.21842
## Proportion of Variance 0.0136 0.00899 0.00705 0.00511 0.00367
## Cumulative Proportion  0.9752 0.98417 0.99122 0.99633 1.00000

Three components are necessary to explain around 76% of the variability and using four components will explain 83% of it.

# Make a plot of the first two PCs

plot(X_pcs$x[,1:2],pch=20,col=colors_Y,xlim=c(-4.5,5.5),ylim=c(-7,5),
     main="First two PCs")

This plot shows that there are two clearly differenciated groups with the first two components.

6.2 Methods

We divide the data into the train and test partitions.

set.seed(485)

# Obtain a training and a test sample: Take at random the 70% of the observations as the training sample and 
# the other 30% of the observations as the test sample

n_train <- floor(.7*n)
n_test <- n - n_train
c(n_train,n_test)
## [1] 143  62
# Obtain the indices of the observations in both data sets at random

i_train <- sort(sample(1:n,n_train))
length(i_train)
## [1] 143
head(i_train)
## [1] 2 3 4 5 6 7
i_test <- setdiff(1:n,i_train)
length(i_test)
## [1] 62
head(i_test)
## [1]  1  8 14 19 20 27
# Obtain the training data matrix and the associated response vector

X_train <- X[i_train,]
Y_train <- Y[i_train]


# Obtain the test data matrix and the associated response vector

X_test <- X[i_test,]
Y_test <- Y[i_test]


# Which are the proportions of diesel and gas cars in the training and test sample

sum(Y_train=="gas")/n_train
## [1] 0.8951
sum(Y_train=="diesel")/n_train
## [1] 0.1049
sum(Y_test=="gas")/n_test
## [1] 0.9194
sum(Y_test=="diesel")/n_test
## [1] 0.08065
# There similar to the original data

6.2.1 k-nearest neighbors (kNN)

The KNN method will be using the Euclidean distance, so the data will b scaled in order to obtained the best results posible. First, we will compute the LOOCV Error Rate (LER) in order to estimate the optimal value for k.

# Scale the data
scaled_X_train <- scale(X_train,center=F)
scaled_X_test <- scale(X_test,center=F)

# We take k ranging from 3 to 40 as the sample size is big enough and estimate the LOOCV error rate

LER <- rep(NA,37)
for (i in 3 : 40){
  print(i)
  knn_output <- knn.cv(scaled_X_train,Y_train,k=i)
  LER[i] <- 1 - mean(knn_output==Y_train)
}
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
## [1] 11
## [1] 12
## [1] 13
## [1] 14
## [1] 15
## [1] 16
## [1] 17
## [1] 18
## [1] 19
## [1] 20
## [1] 21
## [1] 22
## [1] 23
## [1] 24
## [1] 25
## [1] 26
## [1] 27
## [1] 28
## [1] 29
## [1] 30
## [1] 31
## [1] 32
## [1] 33
## [1] 34
## [1] 35
## [1] 36
## [1] 37
## [1] 38
## [1] 39
## [1] 40
plot(1:40,LER,pch=20,col=color_1,type="b",
     xlab="k",ylab="LER",main="LER")

# Take k as the one that gives the minimum LOOCV error rate

(k <- which.min(LER))
## [1] 3

The optimal values for K is 3.

# Classify the responses in the test data set with the k selected

knn_Y_test <- knn(scaled_X_train,scaled_X_test,Y_train,k=k,prob=T)
head(knn_Y_test)
## [1] gas gas gas gas gas gas
## Levels: diesel gas
# Number of cars classified in each group

summary(knn_Y_test)
## diesel    gas 
##      5     57
# Confusion table

table(Y_test,knn_Y_test)
##         knn_Y_test
## Y_test   diesel gas
##   diesel      5   0
##   gas         0  57
# Obtain the Test Error Rate (TER)

knn_TER <- mean(Y_test!=knn_Y_test)
knn_TER
## [1] 0
# Estimated probabilities of the classifications made for the winner group

prob_knn_Y_test <- attributes(knn_Y_test)$prob
head(prob_knn_Y_test)
## [1] 1 1 1 1 1 1
# Make a plot of the probabilities of the winner group
# In green, good classifications, in red, wrong classifications

colors_errors <- c(color_4,color_3)[1*(Y_test==knn_Y_test)+1]
plot(1:n_test,prob_knn_Y_test,col=colors_errors,pch=20,type="p",
     xlab="Test sample",ylab="Winning probabilities",main="Winning probabilities")

# Note that there have been some ties and that observations at the same distance to the observation 
# to be classified have been included

6.2.2 Methods based on the Bayes Theorem.

6.2.2.1 Linear Discriminant Analysis (LDA).

lda_train <- lda(Y_train~.,data=as.data.frame(X_train))

# Estimated prior probabilities for the two groups

lda_train$prior
## diesel    gas 
## 0.1049 0.8951
# Estimated sample mean vectors

t(lda_train$means)
##                     diesel       gas
## wheel_base         103.427    97.920
## length             180.493   172.629
## width               67.267    65.595
## height              55.560    53.555
## curb_weight       2841.067  2496.672
## engine_size        134.400   123.266
## bore                 3.358     3.276
## stroke               3.490     3.220
## compressio_ratio    22.073     8.836
## horsepower          83.067   104.781
## peak_rpm          4443.333  5217.188
## highway_mpg         35.400    30.312
## price            16085.867 12629.531
# Classify the observations in the test sample

lda_test <- predict(lda_train,newdata=as.data.frame(X_test))

# The vector of classifications made can be found here

lda_Y_test <- lda_test$class
lda_Y_test
##  [1] gas    gas    gas    gas    gas    gas    gas    gas    gas    gas   
## [11] gas    gas    gas    gas    gas    gas    gas    gas    gas    gas   
## [21] gas    gas    gas    gas    gas    gas    diesel diesel gas    diesel
## [31] gas    gas    gas    gas    gas    gas    gas    gas    gas    gas   
## [41] gas    gas    gas    gas    gas    gas    gas    gas    gas    gas   
## [51] gas    gas    gas    diesel gas    gas    gas    gas    gas    gas   
## [61] diesel gas   
## Levels: diesel gas
# Number of cars classified in each group

table(lda_Y_test)
## lda_Y_test
## diesel    gas 
##      5     57
# Contingency table with good and bad classifications

table(Y_test,lda_Y_test)
##         lda_Y_test
## Y_test   diesel gas
##   diesel      5   0
##   gas         0  57
# Test Error Rate (TER)

lda_TER <- mean(Y_test!=lda_Y_test)
lda_TER
## [1] 0
# Conditional probabilities of the classifications made with the test sample

prob_lda_Y_test <- lda_test$posterior
head(prob_lda_Y_test)
##        diesel gas
## 1  2.508e-170   1
## 8  5.058e-170   1
## 14 1.280e-151   1
## 19 1.538e-162   1
## 20 3.739e-160   1
## 27 3.325e-159   1
# In green, good classifications, in red, wrong classifications

colors_errors <- c(color_4,color_3)[1*(Y_test==lda_Y_test)+1]
plot(1:n_test,prob_lda_Y_test[,1],col=colors_errors,pch=20,type="p",
     xlab="Test sample",ylab="Probabilities of diesel",
     main="Probabilities of gas (LDA)")
abline(h=0.5)

6.2.2.2 Quadratic Discriminant Analysis (QDA).

qda_train <- qda(Y_train~.,data=as.data.frame(X_train))

eigen(cov(X_train[Y_train=="diesel",]))$values
##  [1] 7.338e+07 5.905e+04 1.442e+04 1.065e+02 2.971e+01 1.116e+01 8.154e+00
##  [8] 1.335e+00 4.828e-01 1.382e-01 3.418e-02 1.219e-03 7.234e-05
summary(X_train[Y_train=="diesel",])
##    wheel_base        length        width          height      curb_weight  
##  Min.   : 94.5   Min.   :165   Min.   :63.8   Min.   :52.8   Min.   :2017  
##  1st Qu.: 97.3   1st Qu.:172   1st Qu.:65.5   1st Qu.:54.7   1st Qu.:2297  
##  Median :102.4   Median :178   Median :66.5   Median :55.5   Median :2579  
##  Mean   :103.4   Mean   :180   Mean   :67.3   Mean   :55.6   Mean   :2841  
##  3rd Qu.:109.0   3rd Qu.:189   3rd Qu.:69.3   3rd Qu.:56.4   3rd Qu.:3490  
##  Max.   :115.6   Max.   :203   Max.   :71.7   Max.   :58.7   Max.   :3770  
##   engine_size       bore          stroke     compressio_ratio   horsepower   
##  Min.   : 97   Min.   :2.99   Min.   :3.35   Min.   :21.0     Min.   : 52.0  
##  1st Qu.:106   1st Qu.:3.14   1st Qu.:3.40   1st Qu.:21.5     1st Qu.: 60.0  
##  Median :122   Median :3.39   Median :3.47   Median :22.0     Median : 72.0  
##  Mean   :134   Mean   :3.36   Mean   :3.49   Mean   :22.1     Mean   : 83.1  
##  3rd Qu.:168   3rd Qu.:3.58   3rd Qu.:3.64   3rd Qu.:22.6     3rd Qu.:109.0  
##  Max.   :183   Max.   :3.70   Max.   :3.64   Max.   :23.0     Max.   :123.0  
##     peak_rpm     highway_mpg       price      
##  Min.   :4150   Min.   :25.0   Min.   : 7099  
##  1st Qu.:4350   1st Qu.:25.0   1st Qu.: 8696  
##  Median :4500   Median :36.0   Median :13845  
##  Mean   :4443   Mean   :35.4   Mean   :16086  
##  3rd Qu.:4500   3rd Qu.:42.0   3rd Qu.:21948  
##  Max.   :4800   Max.   :50.0   Max.   :31600
# Therefore, we apply qda excluding this columns of all the data sets

qda_train <- qda(Y_train ~ .,data=as.data.frame(X_train))

# Estimated unconditional probabilities

qda_train$prior
## diesel    gas 
## 0.1049 0.8951
# Estimated sample mean vectors

t(qda_train$means)
##                     diesel       gas
## wheel_base         103.427    97.920
## length             180.493   172.629
## width               67.267    65.595
## height              55.560    53.555
## curb_weight       2841.067  2496.672
## engine_size        134.400   123.266
## bore                 3.358     3.276
## stroke               3.490     3.220
## compressio_ratio    22.073     8.836
## horsepower          83.067   104.781
## peak_rpm          4443.333  5217.188
## highway_mpg         35.400    30.312
## price            16085.867 12629.531
# Classify the observations in the test sample

qda_test <- predict(qda_train,newdata=as.data.frame(X_test))

# The vector of classifications made can be found here

qda_Y_test <- qda_test$class
qda_Y_test
##  [1] gas    gas    gas    gas    gas    gas    gas    gas    gas    gas   
## [11] gas    gas    gas    gas    gas    gas    gas    gas    gas    gas   
## [21] gas    gas    gas    gas    gas    gas    diesel diesel gas    diesel
## [31] gas    gas    gas    gas    gas    gas    gas    gas    gas    gas   
## [41] gas    gas    gas    gas    gas    gas    gas    gas    gas    gas   
## [51] gas    gas    gas    diesel gas    gas    gas    gas    gas    gas   
## [61] diesel gas   
## Levels: diesel gas
# Number of cars classified in each group

table(qda_Y_test)
## qda_Y_test
## diesel    gas 
##      5     57
# Contingency table with good and bad classifications

table(Y_test,qda_Y_test)
##         qda_Y_test
## Y_test   diesel gas
##   diesel      5   0
##   gas         0  57
# Test Error Rate (TER)

qda_TER <- mean(Y_test!=qda_Y_test)
qda_TER
## [1] 0
# Conditional probabilities of the classifications made with the test sample

prob_qda_Y_test <- qda_test$posterior
head(prob_qda_Y_test)
##    diesel gas
## 1       0   1
## 8       0   1
## 14      0   1
## 19      0   1
## 20      0   1
## 27      0   1
# In green, good classifications, in red, wrong classifications

plot(1:n_test,prob_qda_Y_test[,1],col=colors_errors,pch=20,type="p",
     xlab="Test sample",ylab="Probabilities of gas",
     main="Probabilities of gas (QDA)")
abline(h=0.5)

6.2.2.3 Naive Bayes (NB).

nb_train <- gaussian_naive_bayes(X_train,Y_train)

# Estimated unconditional probabilities

nb_train$prior
## diesel    gas 
## 0.1049 0.8951
# Estimated sample mean vectors

t(nb_train$params$mu)
##                     diesel       gas
## wheel_base         103.427    97.920
## length             180.493   172.629
## width               67.267    65.595
## height              55.560    53.555
## curb_weight       2841.067  2496.672
## engine_size        134.400   123.266
## bore                 3.358     3.276
## stroke               3.490     3.220
## compressio_ratio    22.073     8.836
## horsepower          83.067   104.781
## peak_rpm          4443.333  5217.188
## highway_mpg         35.400    30.312
## price            16085.867 12629.531
t(nb_train$params$sd)
##                     diesel       gas
## wheel_base          7.0171    5.3508
## length             11.8971   12.0634
## width               2.4956    1.9380
## height              1.6949    2.4858
## curb_weight       624.5451  482.9014
## engine_size        35.0404   40.8650
## bore                0.2618    0.2784
## stroke              0.1218    0.3194
## compressio_ratio    0.7156    0.7199
## horsepower         27.9322   40.2105
## peak_rpm          203.4231  425.2113
## highway_mpg         8.9427    6.6104
## price            8545.4190 7775.8375
# Classify the observations in the test sample

nb_test <- predict(nb_train,newdata=as.matrix(X_test),type="prob")

# The vector of classifications made can be found here

nb_Y_test <- c("diesel","gas")[apply(nb_test,1,which.max)]
nb_Y_test
##  [1] "gas"    "gas"    "gas"    "gas"    "gas"    "gas"    "gas"    "gas"   
##  [9] "gas"    "gas"    "gas"    "gas"    "gas"    "gas"    "gas"    "gas"   
## [17] "gas"    "gas"    "gas"    "gas"    "gas"    "gas"    "gas"    "gas"   
## [25] "gas"    "gas"    "diesel" "diesel" "gas"    "diesel" "gas"    "gas"   
## [33] "gas"    "gas"    "gas"    "gas"    "gas"    "gas"    "gas"    "gas"   
## [41] "gas"    "gas"    "gas"    "gas"    "gas"    "gas"    "gas"    "gas"   
## [49] "gas"    "gas"    "gas"    "gas"    "gas"    "diesel" "gas"    "gas"   
## [57] "gas"    "gas"    "gas"    "gas"    "diesel" "gas"
table(nb_Y_test)
## nb_Y_test
## diesel    gas 
##      5     57
# Contingency table with good and bad classifications

table(Y_test,nb_Y_test)
##         nb_Y_test
## Y_test   diesel gas
##   diesel      5   0
##   gas         0  57
# Test Error Rate (TER)

nb_TER <- mean(Y_test!=nb_Y_test)
nb_TER
## [1] 0
# Conditional probabilities of the classifications made with the test sample

prob_nb_Y_test <- nb_test
head(prob_nb_Y_test)
##         diesel gas
## [1,] 9.784e-88   1
## [2,] 1.183e-82   1
## [3,] 3.010e-74   1
## [4,] 6.092e-74   1
## [5,] 2.194e-75   1
## [6,] 1.390e-78   1
# In green, good classifications, in red, wrong classifications

plot(1:n_test,prob_nb_Y_test[,1],col=colors_errors,pch=20,type="p",
     xlab="Test sample",ylab="Probabilities of gas",
     main="Probabilities of gas (Naive Bayes)")
abline(h=0.5)

6.2.3 Logistic regression (LR).

lr_train <- multinom(Y_train~.,data=as.data.frame(X_train))
## # weights:  15 (14 variable)
## initial  value 99.120047 
## iter  10 value 23.259049
## iter  20 value 0.173415
## iter  30 value 0.000495
## iter  30 value 0.000017
## final  value 0.000017 
## converged
# Have a look at the estimated coefficients and their standard errors

summary(lr_train)
## Call:
## multinom(formula = Y_train ~ ., data = as.data.frame(X_train))
## 
## Coefficients:
##                      Values Std. Err.
## (Intercept)       1.017e+00    0.2273
## wheel_base        7.134e-01   30.0466
## length           -1.196e-01   26.9029
## width             7.348e-01   14.4214
## height           -7.451e-01   25.1233
## curb_weight      -1.545e-03    1.1909
## engine_size       1.484e-02   36.6575
## bore             -2.301e+00    0.5050
## stroke           -4.251e+00    1.2341
## compressio_ratio -3.047e+00   14.3096
## horsepower        5.409e-02   31.0416
## peak_rpm         -7.352e-04    1.2466
## highway_mpg       3.804e-01   12.1366
## price            -8.715e-05    0.1303
## 
## Residual Deviance: 3.348e-05 
## AIC: 28
# To see which are the most significant coefficients, we use the t-test that is computed next

(t_test_lr_train <- summary(lr_train)$coefficients/summary(lr_train)$standard.errors)
##      (Intercept)       wheel_base           length            width 
##        4.4723754        0.0237417       -0.0044456        0.0509520 
##           height      curb_weight      engine_size             bore 
##       -0.0296563       -0.0012971        0.0004049       -4.5568640 
##           stroke compressio_ratio       horsepower         peak_rpm 
##       -3.4445828       -0.2129239        0.0017426       -0.0005897 
##      highway_mpg            price 
##        0.0313466       -0.0006688
# Sort the absolute value of the t-test in decreasing order

sort(abs(t_test_lr_train),decreasing=TRUE)
##             bore      (Intercept)           stroke compressio_ratio 
##        4.5568640        4.4723754        3.4445828        0.2129239 
##            width      highway_mpg           height       wheel_base 
##        0.0509520        0.0313466        0.0296563        0.0237417 
##           length       horsepower      curb_weight            price 
##        0.0044456        0.0017426        0.0012971        0.0006688 
##         peak_rpm      engine_size 
##        0.0005897        0.0004049

The most relevant variables seem to be the bore and stroke.

There are a few variables, which are not significant such as the engine_size and the price.

# Classify the responses in the test data set and estimate the test error rate

lr_test <- predict(lr_train,newdata=as.data.frame(X_test))
head(lr_test)
## [1] gas gas gas gas gas gas
## Levels: diesel gas
# Number of cars classified in each group

summary(lr_test)
## diesel    gas 
##      5     57
# Table with good and bad classifications

table(Y_test,lr_test)
##         lr_test
## Y_test   diesel gas
##   diesel      5   0
##   gas         0  57
# Obtain the Test Error Rate (TER)

lr_TER <- mean(Y_test!=lr_test)
lr_TER
## [1] 0
# Obtain the probabilities of gas

prob_lr_test <- 1 - predict(lr_train,newdata=X_test,type ="probs")
head(prob_lr_test)
##  1  8 14 19 20 27 
##  0  0  0  0  0  0
# In green, good classifications, in red, wrong classifications

plot(1:n_test,prob_lr_test,col=colors_errors,pch=20,type="p",
     xlab="Test sample",ylab="Probabilities of gas",
     main="Probabilities of gas (Logistic Regression)")
abline(h=0.5)

6.3 Conclusions

Looking at the obtained results with the supervised classification, all the methods classify perfectly the observations of the test sample. Therefore, there would be no difference in using one method or another.

However, these errors are not reliable as the response variable that has been used, fuel_type, makes the data heavily imbalanced. Moreover, the amount of observations of the diesel cars is only 20, so there are not sufficient cars in order to train and test the model correctly. This problem could have been solved with oversampling but in this case, it did not work well, which would be due to the homogeneity of these cars.